28236 lines
998 KiB
Plaintext
28236 lines
998 KiB
Plaintext
\input texinfo @c -*-texinfo-*-
|
|
@c $NetBSD: awk.texi,v 1.1 2004/03/27 11:34:11 jdolecek Exp $
|
|
@c %**start of header (This is for running Texinfo on a region.)
|
|
@setfilename awk.info
|
|
@settitle The GNU Awk User's Guide
|
|
@c %**end of header (This is for running Texinfo on a region.)
|
|
|
|
@dircategory Text creation and manipulation
|
|
@direntry
|
|
* Gawk: (awk). A text scanning and processing language.
|
|
@end direntry
|
|
@dircategory Individual utilities
|
|
@direntry
|
|
* awk: (awk)Invoking gawk. Text scanning and processing.
|
|
@end direntry
|
|
|
|
@set xref-automatic-section-title
|
|
|
|
@c The following information should be updated here only!
|
|
@c This sets the edition of the document, the version of gawk it
|
|
@c applies to and all the info about who's publishing this edition
|
|
|
|
@c These apply across the board.
|
|
@set UPDATE-MONTH June, 2003
|
|
@set VERSION 3.1
|
|
@set PATCHLEVEL 3
|
|
|
|
@set FSF
|
|
|
|
@set TITLE GAWK: Effective AWK Programming
|
|
@set SUBTITLE A User's Guide for GNU Awk
|
|
@set EDITION 3
|
|
|
|
@iftex
|
|
@set DOCUMENT book
|
|
@set CHAPTER chapter
|
|
@set APPENDIX appendix
|
|
@set SECTION section
|
|
@set SUBSECTION subsection
|
|
@set DARKCORNER @inmargin{@image{lflashlight,1cm}, @image{rflashlight,1cm}}
|
|
@end iftex
|
|
@ifinfo
|
|
@set DOCUMENT Info file
|
|
@set CHAPTER major node
|
|
@set APPENDIX major node
|
|
@set SECTION minor node
|
|
@set SUBSECTION node
|
|
@set DARKCORNER (d.c.)
|
|
@end ifinfo
|
|
@ifhtml
|
|
@set DOCUMENT Web page
|
|
@set CHAPTER chapter
|
|
@set APPENDIX appendix
|
|
@set SECTION section
|
|
@set SUBSECTION subsection
|
|
@set DARKCORNER (d.c.)
|
|
@end ifhtml
|
|
@ifxml
|
|
@set DOCUMENT book
|
|
@set CHAPTER chapter
|
|
@set APPENDIX appendix
|
|
@set SECTION section
|
|
@set SUBSECTION subsection
|
|
@set DARKCORNER (d.c.)
|
|
@end ifxml
|
|
|
|
@c some special symbols
|
|
@iftex
|
|
@set LEQ @math{@leq}
|
|
@end iftex
|
|
@ifnottex
|
|
@set LEQ <=
|
|
@end ifnottex
|
|
|
|
@set FN file name
|
|
@set FFN File Name
|
|
@set DF data file
|
|
@set DDF Data File
|
|
@set PVERSION version
|
|
@set CTL Ctrl
|
|
|
|
@ignore
|
|
Some comments on the layout for TeX.
|
|
1. Use at least texinfo.tex 2000-09-06.09
|
|
2. I have done A LOT of work to make this look good. There are `@page' commands
|
|
and use of `@group ... @end group' in a number of places. If you muck
|
|
with anything, it's your responsibility not to break the layout.
|
|
@end ignore
|
|
|
|
@c merge the function and variable indexes into the concept index
|
|
@ifinfo
|
|
@synindex fn cp
|
|
@synindex vr cp
|
|
@end ifinfo
|
|
@iftex
|
|
@syncodeindex fn cp
|
|
@syncodeindex vr cp
|
|
@end iftex
|
|
@ifxml
|
|
@syncodeindex fn cp
|
|
@syncodeindex vr cp
|
|
@end ifxml
|
|
|
|
@c If "finalout" is commented out, the printed output will show
|
|
@c black boxes that mark lines that are too long. Thus, it is
|
|
@c unwise to comment it out when running a master in case there are
|
|
@c overfulls which are deemed okay.
|
|
|
|
@iftex
|
|
@finalout
|
|
@end iftex
|
|
|
|
@copying
|
|
Copyright @copyright{} 1989, 1991, 1992, 1993, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003 Free Software Foundation, Inc.
|
|
@sp 2
|
|
|
|
This is Edition @value{EDITION} of @cite{@value{TITLE}: @value{SUBTITLE}},
|
|
for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU
|
|
implementation of AWK.
|
|
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.2 or
|
|
any later version published by the Free Software Foundation; with the
|
|
Invariant Sections being ``GNU General Public License'', the Front-Cover
|
|
texts being (a) (see below), and with the Back-Cover Texts being (b)
|
|
(see below). A copy of the license is included in the section entitled
|
|
``GNU Free Documentation License''.
|
|
|
|
@enumerate a
|
|
@item
|
|
``A GNU Manual''
|
|
|
|
@item
|
|
``You have freedom to copy and modify this GNU Manual, like GNU
|
|
software. Copies published by the Free Software Foundation raise
|
|
funds for GNU development.''
|
|
@end enumerate
|
|
@end copying
|
|
|
|
@c Comment out the "smallbook" for technical review. Saves
|
|
@c considerable paper. Remember to turn it back on *before*
|
|
@c starting the page-breaking work.
|
|
|
|
@c 4/2002: Karl Berry recommends commenting out this and the
|
|
@c `@setchapternewpage odd', and letting users use `texi2dvi -t'
|
|
@c if they want to waste paper.
|
|
@c @smallbook
|
|
|
|
|
|
@c Uncomment this for the release. Leaving it off saves paper
|
|
@c during editing and review.
|
|
@c @setchapternewpage odd
|
|
|
|
@titlepage
|
|
@title @value{TITLE}
|
|
@subtitle @value{SUBTITLE}
|
|
@subtitle Edition @value{EDITION}
|
|
@subtitle @value{UPDATE-MONTH}
|
|
@author Arnold D. Robbins
|
|
|
|
@c Include the Distribution inside the titlepage environment so
|
|
@c that headings are turned off. Headings on and off do not work.
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
@ignore
|
|
The programs and applications presented in this book have been
|
|
included for their instructional value. They have been tested with care
|
|
but are not guaranteed for any particular purpose. The publisher does not
|
|
offer any warranties or representations, nor does it accept any
|
|
liabilities with respect to the programs or applications.
|
|
So there.
|
|
@sp 2
|
|
UNIX is a registered trademark of The Open Group in the United States and other countries. @*
|
|
Microsoft, MS and MS-DOS are registered trademarks, and Windows is a
|
|
trademark of Microsoft Corporation in the United States and other
|
|
countries. @*
|
|
Atari, 520ST, 1040ST, TT, STE, Mega and Falcon are registered trademarks
|
|
or trademarks of Atari Corporation. @*
|
|
DEC, Digital, OpenVMS, ULTRIX and VMS are trademarks of Digital Equipment
|
|
Corporation. @*
|
|
@end ignore
|
|
``To boldly go where no man has gone before'' is a
|
|
Registered Trademark of Paramount Pictures Corporation. @*
|
|
@c sorry, i couldn't resist
|
|
@sp 3
|
|
Published by:
|
|
@sp 1
|
|
|
|
Free Software Foundation @*
|
|
59 Temple Place --- Suite 330 @*
|
|
Boston, MA 02111-1307 USA @*
|
|
Phone: +1-617-542-5942 @*
|
|
Fax: +1-617-542-2652 @*
|
|
Email: @email{gnu@@gnu.org} @*
|
|
URL: @uref{http://www.gnu.org/} @*
|
|
|
|
@c This one is correct for gawk 3.1.0 from the FSF
|
|
ISBN 1-882114-28-0 @*
|
|
@sp 2
|
|
@insertcopying
|
|
@sp 2
|
|
Cover art by Etienne Suvasa.
|
|
@end titlepage
|
|
|
|
@c Thanks to Bob Chassell for directions on doing dedications.
|
|
@iftex
|
|
@headings off
|
|
@page
|
|
@w{ }
|
|
@sp 9
|
|
@center @i{To Miriam, for making me complete.}
|
|
@sp 1
|
|
@center @i{To Chana, for the joy you bring us.}
|
|
@sp 1
|
|
@center @i{To Rivka, for the exponential increase.}
|
|
@sp 1
|
|
@center @i{To Nachum, for the added dimension.}
|
|
@sp 1
|
|
@center @i{To Malka, for the new beginning.}
|
|
@w{ }
|
|
@page
|
|
@w{ }
|
|
@page
|
|
@headings on
|
|
@end iftex
|
|
|
|
@iftex
|
|
@headings off
|
|
@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
|
|
@oddheading @| @| @strong{@thischapter}@ @ @ @thispage
|
|
@end iftex
|
|
|
|
@ifnottex
|
|
@ifnotxml
|
|
@node Top
|
|
@top General Introduction
|
|
@c Preface node should come right after the Top
|
|
@c node, in `unnumbered' sections, then the chapter, `What is gawk'.
|
|
@c Licensing nodes are appendices, they're not central to AWK.
|
|
|
|
This file documents @command{awk}, a program that you can use to select
|
|
particular records in a file and perform operations upon them.
|
|
|
|
@insertcopying
|
|
|
|
@end ifnotxml
|
|
@end ifnottex
|
|
|
|
@menu
|
|
* Foreword:: Some nice words about this
|
|
@value{DOCUMENT}.
|
|
* Preface:: What this @value{DOCUMENT} is about; brief
|
|
history and acknowledgments.
|
|
* Getting Started:: A basic introduction to using
|
|
@command{awk}. How to run an @command{awk}
|
|
program. Command-line syntax.
|
|
* Regexp:: All about matching things using regular
|
|
expressions.
|
|
* Reading Files:: How to read files and manipulate fields.
|
|
* Printing:: How to print using @command{awk}. Describes
|
|
the @code{print} and @code{printf}
|
|
statements. Also describes redirection of
|
|
output.
|
|
* Expressions:: Expressions are the basic building blocks
|
|
of statements.
|
|
* Patterns and Actions:: Overviews of patterns and actions.
|
|
* Arrays:: The description and use of arrays. Also
|
|
includes array-oriented control statements.
|
|
* Functions:: Built-in and user-defined functions.
|
|
* Internationalization:: Getting @command{gawk} to speak your
|
|
language.
|
|
* Advanced Features:: Stuff for advanced users, specific to
|
|
@command{gawk}.
|
|
* Invoking Gawk:: How to run @command{gawk}.
|
|
* Library Functions:: A Library of @command{awk} Functions.
|
|
* Sample Programs:: Many @command{awk} programs with complete
|
|
explanations.
|
|
* Language History:: The evolution of the @command{awk}
|
|
language.
|
|
* Installation:: Installing @command{gawk} under various
|
|
operating systems.
|
|
* Notes:: Notes about @command{gawk} extensions and
|
|
possible future work.
|
|
* Basic Concepts:: A very quick intoduction to programming
|
|
concepts.
|
|
* Glossary:: An explanation of some unfamiliar terms.
|
|
* Copying:: Your right to copy and distribute
|
|
@command{gawk}.
|
|
* GNU Free Documentation License:: The license for this @value{DOCUMENT}.
|
|
* Index:: Concept and Variable Index.
|
|
|
|
@detailmenu
|
|
* History:: The history of @command{gawk} and
|
|
@command{awk}.
|
|
* Names:: What name to use to find @command{awk}.
|
|
* This Manual:: Using this @value{DOCUMENT}. Includes
|
|
sample input files that you can use.
|
|
* Conventions:: Typographical Conventions.
|
|
* Manual History:: Brief history of the GNU project and this
|
|
@value{DOCUMENT}.
|
|
* How To Contribute:: Helping to save the world.
|
|
* Acknowledgments:: Acknowledgments.
|
|
* Running gawk:: How to run @command{gawk} programs;
|
|
includes command-line syntax.
|
|
* One-shot:: Running a short throwaway @command{awk}
|
|
program.
|
|
* Read Terminal:: Using no input files (input from terminal
|
|
instead).
|
|
* Long:: Putting permanent @command{awk} programs in
|
|
files.
|
|
* Executable Scripts:: Making self-contained @command{awk}
|
|
programs.
|
|
* Comments:: Adding documentation to @command{gawk}
|
|
programs.
|
|
* Quoting:: More discussion of shell quoting issues.
|
|
* Sample Data Files:: Sample data files for use in the
|
|
@command{awk} programs illustrated in this
|
|
@value{DOCUMENT}.
|
|
* Very Simple:: A very simple example.
|
|
* Two Rules:: A less simple one-line example using two
|
|
rules.
|
|
* More Complex:: A more complex example.
|
|
* Statements/Lines:: Subdividing or combining statements into
|
|
lines.
|
|
* Other Features:: Other Features of @command{awk}.
|
|
* When:: When to use @command{gawk} and when to use
|
|
other things.
|
|
* Regexp Usage:: How to Use Regular Expressions.
|
|
* Escape Sequences:: How to write nonprinting characters.
|
|
* Regexp Operators:: Regular Expression Operators.
|
|
* Character Lists:: What can go between @samp{[...]}.
|
|
* GNU Regexp Operators:: Operators specific to GNU software.
|
|
* Case-sensitivity:: How to do case-insensitive matching.
|
|
* Leftmost Longest:: How much text matches.
|
|
* Computed Regexps:: Using Dynamic Regexps.
|
|
* Locales:: How the locale affects things.
|
|
* Records:: Controlling how data is split into records.
|
|
* Fields:: An introduction to fields.
|
|
* Nonconstant Fields:: Nonconstant Field Numbers.
|
|
* Changing Fields:: Changing the Contents of a Field.
|
|
* Field Separators:: The field separator and how to change it.
|
|
* Regexp Field Splitting:: Using regexps as the field separator.
|
|
* Single Character Fields:: Making each character a separate field.
|
|
* Command Line Field Separator:: Setting @code{FS} from the command-line.
|
|
* Field Splitting Summary:: Some final points and a summary table.
|
|
* Constant Size:: Reading constant width data.
|
|
* Multiple Line:: Reading multi-line records.
|
|
* Getline:: Reading files under explicit program
|
|
control using the @code{getline} function.
|
|
* Plain Getline:: Using @code{getline} with no arguments.
|
|
* Getline/Variable:: Using @code{getline} into a variable.
|
|
* Getline/File:: Using @code{getline} from a file.
|
|
* Getline/Variable/File:: Using @code{getline} into a variable from a
|
|
file.
|
|
* Getline/Pipe:: Using @code{getline} from a pipe.
|
|
* Getline/Variable/Pipe:: Using @code{getline} into a variable from a
|
|
pipe.
|
|
* Getline/Coprocess:: Using @code{getline} from a coprocess.
|
|
* Getline/Variable/Coprocess:: Using @code{getline} into a variable from a
|
|
coprocess.
|
|
* Getline Notes:: Important things to know about
|
|
@code{getline}.
|
|
* Getline Summary:: Summary of @code{getline} Variants.
|
|
* Print:: The @code{print} statement.
|
|
* Print Examples:: Simple examples of @code{print} statements.
|
|
* Output Separators:: The output separators and how to change
|
|
them.
|
|
* OFMT:: Controlling Numeric Output With
|
|
@code{print}.
|
|
* Printf:: The @code{printf} statement.
|
|
* Basic Printf:: Syntax of the @code{printf} statement.
|
|
* Control Letters:: Format-control letters.
|
|
* Format Modifiers:: Format-specification modifiers.
|
|
* Printf Examples:: Several examples.
|
|
* Redirection:: How to redirect output to multiple files
|
|
and pipes.
|
|
* Special Files:: File name interpretation in @command{gawk}.
|
|
@command{gawk} allows access to inherited
|
|
file descriptors.
|
|
* Special FD:: Special files for I/O.
|
|
* Special Process:: Special files for process information.
|
|
* Special Network:: Special files for network communications.
|
|
* Special Caveats:: Things to watch out for.
|
|
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
|
|
* Constants:: String, numeric and regexp constants.
|
|
* Scalar Constants:: Numeric and string constants.
|
|
* Nondecimal-numbers:: What are octal and hex numbers.
|
|
* Regexp Constants:: Regular Expression constants.
|
|
* Using Constant Regexps:: When and how to use a regexp constant.
|
|
* Variables:: Variables give names to values for later
|
|
use.
|
|
* Using Variables:: Using variables in your programs.
|
|
* Assignment Options:: Setting variables on the command-line and a
|
|
summary of command-line syntax. This is an
|
|
advanced method of input.
|
|
* Conversion:: The conversion of strings to numbers and
|
|
vice versa.
|
|
* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
|
|
etc.)
|
|
* Concatenation:: Concatenating strings.
|
|
* Assignment Ops:: Changing the value of a variable or a
|
|
field.
|
|
* Increment Ops:: Incrementing the numeric value of a
|
|
variable.
|
|
* Truth Values:: What is ``true'' and what is ``false''.
|
|
* Typing and Comparison:: How variables acquire types and how this
|
|
affects comparison of numbers and strings
|
|
with @samp{<}, etc.
|
|
* Boolean Ops:: Combining comparison expressions using
|
|
boolean operators @samp{||} (``or''),
|
|
@samp{&&} (``and'') and @samp{!} (``not'').
|
|
* Conditional Exp:: Conditional expressions select between two
|
|
subexpressions under control of a third
|
|
subexpression.
|
|
* Function Calls:: A function call is an expression.
|
|
* Precedence:: How various operators nest.
|
|
* Pattern Overview:: What goes into a pattern.
|
|
* Regexp Patterns:: Using regexps as patterns.
|
|
* Expression Patterns:: Any expression can be used as a pattern.
|
|
* Ranges:: Pairs of patterns specify record ranges.
|
|
* BEGIN/END:: Specifying initialization and cleanup
|
|
rules.
|
|
* Using BEGIN/END:: How and why to use BEGIN/END rules.
|
|
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
|
|
* Empty:: The empty pattern, which matches every
|
|
record.
|
|
* Using Shell Variables:: How to use shell variables with
|
|
@command{awk}.
|
|
* Action Overview:: What goes into an action.
|
|
* Statements:: Describes the various control statements in
|
|
detail.
|
|
* If Statement:: Conditionally execute some @command{awk}
|
|
statements.
|
|
* While Statement:: Loop until some condition is satisfied.
|
|
* Do Statement:: Do specified action while looping until
|
|
some condition is satisfied.
|
|
* For Statement:: Another looping statement, that provides
|
|
initialization and increment clauses.
|
|
* Switch Statement:: Switch/case evaluation for conditional
|
|
execution of statements based on a value.
|
|
* Break Statement:: Immediately exit the innermost enclosing
|
|
loop.
|
|
* Continue Statement:: Skip to the end of the innermost enclosing
|
|
loop.
|
|
* Next Statement:: Stop processing the current input record.
|
|
* Nextfile Statement:: Stop processing the current file.
|
|
* Exit Statement:: Stop execution of @command{awk}.
|
|
* Built-in Variables:: Summarizes the built-in variables.
|
|
* User-modified:: Built-in variables that you change to
|
|
control @command{awk}.
|
|
* Auto-set:: Built-in variables where @command{awk}
|
|
gives you information.
|
|
* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
|
|
* Array Intro:: Introduction to Arrays
|
|
* Reference to Elements:: How to examine one element of an array.
|
|
* Assigning Elements:: How to change an element of an array.
|
|
* Array Example:: Basic Example of an Array
|
|
* Scanning an Array:: A variation of the @code{for} statement. It
|
|
loops through the indices of an array's
|
|
existing elements.
|
|
* Delete:: The @code{delete} statement removes an
|
|
element from an array.
|
|
* Numeric Array Subscripts:: How to use numbers as subscripts in
|
|
@command{awk}.
|
|
* Uninitialized Subscripts:: Using Uninitialized variables as
|
|
subscripts.
|
|
* Multi-dimensional:: Emulating multidimensional arrays in
|
|
@command{awk}.
|
|
* Multi-scanning:: Scanning multidimensional arrays.
|
|
* Array Sorting:: Sorting array values and indices.
|
|
* Built-in:: Summarizes the built-in functions.
|
|
* Calling Built-in:: How to call built-in functions.
|
|
* Numeric Functions:: Functions that work with numbers, including
|
|
@code{int}, @code{sin} and @code{rand}.
|
|
* String Functions:: Functions for string manipulation, such as
|
|
@code{split}, @code{match} and
|
|
@code{sprintf}.
|
|
* Gory Details:: More than you want to know about @samp{\}
|
|
and @samp{&} with @code{sub}, @code{gsub},
|
|
and @code{gensub}.
|
|
* I/O Functions:: Functions for files and shell commands.
|
|
* Time Functions:: Functions for dealing with timestamps.
|
|
* Bitwise Functions:: Functions for bitwise operations.
|
|
* I18N Functions:: Functions for string translation.
|
|
* User-defined:: Describes User-defined functions in detail.
|
|
* Definition Syntax:: How to write definitions and what they
|
|
mean.
|
|
* Function Example:: An example function definition and what it
|
|
does.
|
|
* Function Caveats:: Things to watch out for.
|
|
* Return Statement:: Specifying the value a function returns.
|
|
* Dynamic Typing:: How variable types can change at runtime.
|
|
* I18N and L10N:: Internationalization and Localization.
|
|
* Explaining gettext:: How GNU @code{gettext} works.
|
|
* Programmer i18n:: Features for the programmer.
|
|
* Translator i18n:: Features for the translator.
|
|
* String Extraction:: Extracting marked strings.
|
|
* Printf Ordering:: Rearranging @code{printf} arguments.
|
|
* I18N Portability:: @command{awk}-level portability issues.
|
|
* I18N Example:: A simple i18n example.
|
|
* Gawk I18N:: @command{gawk} is also internationalized.
|
|
* Nondecimal Data:: Allowing nondecimal input data.
|
|
* Two-way I/O:: Two-way communications with another
|
|
process.
|
|
* TCP/IP Networking:: Using @command{gawk} for network
|
|
programming.
|
|
* Portal Files:: Using @command{gawk} with BSD portals.
|
|
* Profiling:: Profiling your @command{awk} programs.
|
|
* Command Line:: How to run @command{awk}.
|
|
* Options:: Command-line options and their meanings.
|
|
* Other Arguments:: Input file names and variable assignments.
|
|
* AWKPATH Variable:: Searching directories for @command{awk}
|
|
programs.
|
|
* Obsolete:: Obsolete Options and/or features.
|
|
* Undocumented:: Undocumented Options and Features.
|
|
* Known Bugs:: Known Bugs in @command{gawk}.
|
|
* Library Names:: How to best name private global variables
|
|
in library functions.
|
|
* General Functions:: Functions that are of general use.
|
|
* Nextfile Function:: Two implementations of a @code{nextfile}
|
|
function.
|
|
* Assert Function:: A function for assertions in @command{awk}
|
|
programs.
|
|
* Round Function:: A function for rounding if @code{sprintf}
|
|
does not do it correctly.
|
|
* Cliff Random Function:: The Cliff Random Number Generator.
|
|
* Ordinal Functions:: Functions for using characters as numbers
|
|
and vice versa.
|
|
* Join Function:: A function to join an array into a string.
|
|
* Gettimeofday Function:: A function to get formatted times.
|
|
* Data File Management:: Functions for managing command-line data
|
|
files.
|
|
* Filetrans Function:: A function for handling data file
|
|
transitions.
|
|
* Rewind Function:: A function for rereading the current file.
|
|
* File Checking:: Checking that data files are readable.
|
|
* Empty Files:: Checking for zero-length files.
|
|
* Ignoring Assigns:: Treating assignments as file names.
|
|
* Getopt Function:: A function for processing command-line
|
|
arguments.
|
|
* Passwd Functions:: Functions for getting user information.
|
|
* Group Functions:: Functions for getting group information.
|
|
* Running Examples:: How to run these examples.
|
|
* Clones:: Clones of common utilities.
|
|
* Cut Program:: The @command{cut} utility.
|
|
* Egrep Program:: The @command{egrep} utility.
|
|
* Id Program:: The @command{id} utility.
|
|
* Split Program:: The @command{split} utility.
|
|
* Tee Program:: The @command{tee} utility.
|
|
* Uniq Program:: The @command{uniq} utility.
|
|
* Wc Program:: The @command{wc} utility.
|
|
* Miscellaneous Programs:: Some interesting @command{awk} programs.
|
|
* Dupword Program:: Finding duplicated words in a document.
|
|
* Alarm Program:: An alarm clock.
|
|
* Translate Program:: A program similar to the @command{tr}
|
|
utility.
|
|
* Labels Program:: Printing mailing labels.
|
|
* Word Sorting:: A program to produce a word usage count.
|
|
* History Sorting:: Eliminating duplicate entries from a
|
|
history file.
|
|
* Extract Program:: Pulling out programs from Texinfo source
|
|
files.
|
|
* Simple Sed:: A Simple Stream Editor.
|
|
* Igawk Program:: A wrapper for @command{awk} that includes
|
|
files.
|
|
* V7/SVR3.1:: The major changes between V7 and System V
|
|
Release 3.1.
|
|
* SVR4:: Minor changes between System V Releases 3.1
|
|
and 4.
|
|
* POSIX:: New features from the POSIX standard.
|
|
* BTL:: New features from the Bell Laboratories
|
|
version of @command{awk}.
|
|
* POSIX/GNU:: The extensions in @command{gawk} not in
|
|
POSIX @command{awk}.
|
|
* Contributors:: The major contributors to @command{gawk}.
|
|
* Gawk Distribution:: What is in the @command{gawk} distribution.
|
|
* Getting:: How to get the distribution.
|
|
* Extracting:: How to extract the distribution.
|
|
* Distribution contents:: What is in the distribution.
|
|
* Unix Installation:: Installing @command{gawk} under various
|
|
versions of Unix.
|
|
* Quick Installation:: Compiling @command{gawk} under Unix.
|
|
* Additional Configuration Options:: Other compile-time options.
|
|
* Configuration Philosophy:: How it's all supposed to work.
|
|
* Non-Unix Installation:: Installation on Other Operating Systems.
|
|
* Amiga Installation:: Installing @command{gawk} on an Amiga.
|
|
* BeOS Installation:: Installing @command{gawk} on BeOS.
|
|
* PC Installation:: Installing and Compiling @command{gawk} on
|
|
MS-DOS and OS/2.
|
|
* PC Binary Installation:: Installing a prepared distribution.
|
|
* PC Compiling:: Compiling @command{gawk} for MS-DOS, Windows32,
|
|
and OS/2.
|
|
* PC Using:: Running @command{gawk} on MS-DOS, Windows32 and
|
|
OS/2.
|
|
* PC Dynamic:: Compiling @command{gawk} for dynamic
|
|
libraries.
|
|
* Cygwin:: Building and running @command{gawk} for
|
|
Cygwin.
|
|
* VMS Installation:: Installing @command{gawk} on VMS.
|
|
* VMS Compilation:: How to compile @command{gawk} under VMS.
|
|
* VMS Installation Details:: How to install @command{gawk} under VMS.
|
|
* VMS Running:: How to run @command{gawk} under VMS.
|
|
* VMS POSIX:: Alternate instructions for VMS POSIX.
|
|
* Unsupported:: Systems whose ports are no longer
|
|
supported.
|
|
* Atari Installation:: Installing @command{gawk} on the Atari ST.
|
|
* Atari Compiling:: Compiling @command{gawk} on Atari.
|
|
* Atari Using:: Running @command{gawk} on Atari.
|
|
* Tandem Installation:: Installing @command{gawk} on a Tandem.
|
|
* Bugs:: Reporting Problems and Bugs.
|
|
* Other Versions:: Other freely available @command{awk}
|
|
implementations.
|
|
* Compatibility Mode:: How to disable certain @command{gawk}
|
|
extensions.
|
|
* Additions:: Making Additions To @command{gawk}.
|
|
* Adding Code:: Adding code to the main body of
|
|
@command{gawk}.
|
|
* New Ports:: Porting @command{gawk} to a new operating
|
|
system.
|
|
* Dynamic Extensions:: Adding new built-in functions to
|
|
@command{gawk}.
|
|
* Internals:: A brief look at some @command{gawk}
|
|
internals.
|
|
* Sample Library:: A example of new functions.
|
|
* Internal File Description:: What the new functions will do.
|
|
* Internal File Ops:: The code for internal file operations.
|
|
* Using Internal File Ops:: How to use an external extension.
|
|
* Future Extensions:: New features that may be implemented one
|
|
day.
|
|
* Basic High Level:: The high level view.
|
|
* Basic Data Typing:: A very quick intro to data types.
|
|
* Floating Point Issues:: Stuff to know about floating-point numbers.
|
|
@end detailmenu
|
|
@end menu
|
|
|
|
@c dedication for Info file
|
|
@ifinfo
|
|
@center To Miriam, for making me complete.
|
|
@sp 1
|
|
@center To Chana, for the joy you bring us.
|
|
@sp 1
|
|
@center To Rivka, for the exponential increase.
|
|
@sp 1
|
|
@center To Nachum, for the added dimension.
|
|
@sp 1
|
|
@center To Malka, for the new beginning.
|
|
@end ifinfo
|
|
|
|
@summarycontents
|
|
@contents
|
|
|
|
@node Foreword
|
|
@unnumbered Foreword
|
|
|
|
Arnold Robbins and I are good friends. We were introduced 11 years ago
|
|
by circumstances---and our favorite programming language, AWK.
|
|
The circumstances started a couple of years
|
|
earlier. I was working at a new job and noticed an unplugged
|
|
Unix computer sitting in the corner. No one knew how to use it,
|
|
and neither did I. However,
|
|
a couple of days later it was running, and
|
|
I was @code{root} and the one-and-only user.
|
|
That day, I began the transition from statistician to Unix programmer.
|
|
|
|
On one of many trips to the library or bookstore in search of
|
|
books on Unix, I found the gray AWK book, a.k.a. Aho, Kernighan and
|
|
Weinberger, @cite{The AWK Programming Language}, Addison-Wesley,
|
|
1988. AWK's simple programming paradigm---find a pattern in the
|
|
input and then perform an action---often reduced complex or tedious
|
|
data manipulations to few lines of code. I was excited to try my
|
|
hand at programming in AWK.
|
|
|
|
Alas, the @command{awk} on my computer was a limited version of the
|
|
language described in the AWK book. I discovered that my computer
|
|
had ``old @command{awk}'' and the AWK book described ``new @command{awk}.''
|
|
I learned that this was typical; the old version refused to step
|
|
aside or relinquish its name. If a system had a new @command{awk}, it was
|
|
invariably called @command{nawk}, and few systems had it.
|
|
The best way to get a new @command{awk} was to @command{ftp} the source code for
|
|
@command{gawk} from @code{prep.ai.mit.edu}. @command{gawk} was a version of
|
|
new @command{awk} written by David Trueman and Arnold, and available under
|
|
the GNU General Public License.
|
|
|
|
(Incidentally,
|
|
it's no longer difficult to find a new @command{awk}. @command{gawk} ships with
|
|
Linux, and you can download binaries or source code for almost
|
|
any system; my wife uses @command{gawk} on her VMS box.)
|
|
|
|
My Unix system started out unplugged from the wall; it certainly was not
|
|
plugged into a network. So, oblivious to the existence of @command{gawk}
|
|
and the Unix community in general, and desiring a new @command{awk}, I wrote
|
|
my own, called @command{mawk}.
|
|
Before I was finished I knew about @command{gawk},
|
|
but it was too late to stop, so I eventually posted
|
|
to a @code{comp.sources} newsgroup.
|
|
|
|
A few days after my posting, I got a friendly email
|
|
from Arnold introducing
|
|
himself. He suggested we share design and algorithms and
|
|
attached a draft of the POSIX standard so
|
|
that I could update @command{mawk} to support language extensions added
|
|
after publication of the AWK book.
|
|
|
|
Frankly, if our roles had
|
|
been reversed, I would not have been so open and we probably would
|
|
have never met. I'm glad we did meet.
|
|
He is an AWK expert's AWK expert and a genuinely nice person.
|
|
Arnold contributes significant amounts of his
|
|
expertise and time to the Free Software Foundation.
|
|
|
|
This book is the @command{gawk} reference manual, but at its core it
|
|
is a book about AWK programming that
|
|
will appeal to a wide audience.
|
|
It is a definitive reference to the AWK language as defined by the
|
|
1987 Bell Labs release and codified in the 1992 POSIX Utilities
|
|
standard.
|
|
|
|
On the other hand, the novice AWK programmer can study
|
|
a wealth of practical programs that emphasize
|
|
the power of AWK's basic idioms:
|
|
data driven control-flow, pattern matching with regular expressions,
|
|
and associative arrays.
|
|
Those looking for something new can try out @command{gawk}'s
|
|
interface to network protocols via special @file{/inet} files.
|
|
|
|
The programs in this book make clear that an AWK program is
|
|
typically much smaller and faster to develop than
|
|
a counterpart written in C.
|
|
Consequently, there is often a payoff to prototype an
|
|
algorithm or design in AWK to get it running quickly and expose
|
|
problems early. Often, the interpreted performance is adequate
|
|
and the AWK prototype becomes the product.
|
|
|
|
The new @command{pgawk} (profiling @command{gawk}), produces
|
|
program execution counts.
|
|
I recently experimented with an algorithm that for
|
|
@math{n} lines of input, exhibited
|
|
@tex
|
|
$\sim\! Cn^2$
|
|
@end tex
|
|
@ifnottex
|
|
~ C n^2
|
|
@end ifnottex
|
|
performance, while
|
|
theory predicted
|
|
@tex
|
|
$\sim\! Cn\log n$
|
|
@end tex
|
|
@ifnottex
|
|
~ C n log n
|
|
@end ifnottex
|
|
behavior. A few minutes poring
|
|
over the @file{awkprof.out} profile pinpointed the problem to
|
|
a single line of code. @command{pgawk} is a welcome addition to
|
|
my programmer's toolbox.
|
|
|
|
Arnold has distilled over a decade of experience writing and
|
|
using AWK programs, and developing @command{gawk}, into this book. If you use
|
|
AWK or want to learn how, then read this book.
|
|
|
|
@display
|
|
Michael Brennan
|
|
Author of @command{mawk}
|
|
@end display
|
|
|
|
@node Preface
|
|
@unnumbered Preface
|
|
@c I saw a comment somewhere that the preface should describe the book itself,
|
|
@c and the introduction should describe what the book covers.
|
|
@c
|
|
@c 12/2000: Chuck wants the preface & intro combined.
|
|
|
|
Several kinds of tasks occur repeatedly
|
|
when working with text files.
|
|
You might want to extract certain lines and discard the rest.
|
|
Or you may need to make changes wherever certain patterns appear,
|
|
but leave the rest of the file alone.
|
|
Writing single-use programs for these tasks in languages such as C, C++, or Pascal
|
|
is time-consuming and inconvenient.
|
|
Such jobs are often easier with @command{awk}.
|
|
The @command{awk} utility interprets a special-purpose programming language
|
|
that makes it easy to handle simple data-reformatting jobs.
|
|
|
|
The GNU implementation of @command{awk} is called @command{gawk}; it is fully
|
|
compatible with the System V Release 4 version of
|
|
@command{awk}. @command{gawk} is also compatible with the POSIX
|
|
specification of the @command{awk} language. This means that all
|
|
properly written @command{awk} programs should work with @command{gawk}.
|
|
Thus, we usually don't distinguish between @command{gawk} and other
|
|
@command{awk} implementations.
|
|
|
|
@cindex @command{awk}, POSIX and, See Also POSIX @command{awk}
|
|
@cindex @command{awk}, POSIX and
|
|
@cindex POSIX, @command{awk} and
|
|
@cindex @command{gawk}, @command{awk} and
|
|
@cindex @command{awk}, @command{gawk} and
|
|
@cindex @command{awk}, uses for
|
|
Using @command{awk} allows you to:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Manage small, personal databases
|
|
|
|
@item
|
|
Generate reports
|
|
|
|
@item
|
|
Validate data
|
|
|
|
@item
|
|
Produce indexes and perform other document preparation tasks
|
|
|
|
@item
|
|
Experiment with algorithms that you can adapt later to other computer
|
|
languages
|
|
@end itemize
|
|
|
|
@cindex @command{awk}, See Also @command{gawk}
|
|
@cindex @command{gawk}, See Also @command{awk}
|
|
@cindex @command{gawk}, uses for
|
|
In addition,
|
|
@command{gawk}
|
|
provides facilities that make it easy to:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Extract bits and pieces of data for processing
|
|
|
|
@item
|
|
Sort data
|
|
|
|
@item
|
|
Perform simple network communications
|
|
@end itemize
|
|
|
|
This @value{DOCUMENT} teaches you about the @command{awk} language and
|
|
how you can use it effectively. You should already be familiar with basic
|
|
system commands, such as @command{cat} and @command{ls},@footnote{These commands
|
|
are available on POSIX-compliant systems, as well as on traditional
|
|
Unix-based systems. If you are using some other operating system, you still need to
|
|
be familiar with the ideas of I/O redirection and pipes.} as well as basic shell
|
|
facilities, such as input/output (I/O) redirection and pipes.
|
|
|
|
@cindex GNU @command{awk}, See @command{gawk}
|
|
Implementations of the @command{awk} language are available for many
|
|
different computing environments. This @value{DOCUMENT}, while describing
|
|
the @command{awk} language in general, also describes the particular
|
|
implementation of @command{awk} called @command{gawk} (which stands for
|
|
``GNU awk''). @command{gawk} runs on a broad range of Unix systems,
|
|
ranging from 80386 PC-based computers up through large-scale systems,
|
|
such as Crays. @command{gawk} has also been ported to Mac OS X,
|
|
MS-DOS, Microsoft Windows (all versions) and OS/2 PCs, Atari and Amiga
|
|
microcomputers, BeOS, Tandem D20, and VMS.
|
|
|
|
@menu
|
|
* History:: The history of @command{gawk} and
|
|
@command{awk}.
|
|
* Names:: What name to use to find @command{awk}.
|
|
* This Manual:: Using this @value{DOCUMENT}. Includes sample
|
|
input files that you can use.
|
|
* Conventions:: Typographical Conventions.
|
|
* Manual History:: Brief history of the GNU project and this
|
|
@value{DOCUMENT}.
|
|
* How To Contribute:: Helping to save the world.
|
|
* Acknowledgments:: Acknowledgments.
|
|
@end menu
|
|
|
|
@node History
|
|
@unnumberedsec History of @command{awk} and @command{gawk}
|
|
@cindex recipe for a programming language
|
|
@cindex programming language, recipe for
|
|
@center Recipe For A Programming Language
|
|
|
|
@multitable {2 parts} {1 part @code{egrep}} {1 part @code{snobol}}
|
|
@item @tab 1 part @code{egrep} @tab 1 part @code{snobol}
|
|
@item @tab 2 parts @code{ed} @tab 3 parts C
|
|
@end multitable
|
|
|
|
@quotation
|
|
Blend all parts well using @code{lex} and @code{yacc}.
|
|
Document minimally and release.
|
|
|
|
After eight years, add another part @code{egrep} and two
|
|
more parts C. Document very well and release.
|
|
@end quotation
|
|
|
|
@cindex Aho, Alfred
|
|
@cindex Weinberger, Peter
|
|
@cindex Kernighan, Brian
|
|
@cindex @command{awk}, history of
|
|
The name @command{awk} comes from the initials of its designers: Alfred V.@:
|
|
Aho, Peter J.@: Weinberger and Brian W.@: Kernighan. The original version of
|
|
@command{awk} was written in 1977 at AT&T Bell Laboratories.
|
|
In 1985, a new version made the programming
|
|
language more powerful, introducing user-defined functions, multiple input
|
|
streams, and computed regular expressions.
|
|
This new version became widely available with Unix System V
|
|
Release 3.1 (SVR3.1).
|
|
The version in SVR4 added some new features and cleaned
|
|
up the behavior in some of the ``dark corners'' of the language.
|
|
The specification for @command{awk} in the POSIX Command Language
|
|
and Utilities standard further clarified the language.
|
|
Both the @command{gawk} designers and the original Bell Laboratories @command{awk}
|
|
designers provided feedback for the POSIX specification.
|
|
|
|
@cindex Rubin, Paul
|
|
@cindex Fenlason, Jay
|
|
@cindex Trueman, David
|
|
Paul Rubin wrote the GNU implementation, @command{gawk}, in 1986.
|
|
Jay Fenlason completed it, with advice from Richard Stallman. John Woods
|
|
contributed parts of the code as well. In 1988 and 1989, David Trueman, with
|
|
help from me, thoroughly reworked @command{gawk} for compatibility
|
|
with the newer @command{awk}.
|
|
Circa 1995, I became the primary maintainer.
|
|
Current development focuses on bug fixes,
|
|
performance improvements, standards compliance, and occasionally, new features.
|
|
|
|
In May of 1997, J@"urgen Kahrs felt the need for network access
|
|
from @command{awk}, and with a little help from me, set about adding
|
|
features to do this for @command{gawk}. At that time, he also
|
|
wrote the bulk of
|
|
@cite{TCP/IP Internetworking with @command{gawk}}
|
|
(a separate document, available as part of the @command{gawk} distribution).
|
|
His code finally became part of the main @command{gawk} distribution
|
|
with @command{gawk} @value{PVERSION} 3.1.
|
|
|
|
@xref{Contributors},
|
|
for a complete list of those who made important contributions to @command{gawk}.
|
|
|
|
@node Names
|
|
@section A Rose by Any Other Name
|
|
|
|
@cindex @command{awk}, new vs. old
|
|
The @command{awk} language has evolved over the years. Full details are
|
|
provided in @ref{Language History}.
|
|
The language described in this @value{DOCUMENT}
|
|
is often referred to as ``new @command{awk}'' (@command{nawk}).
|
|
|
|
@cindex @command{awk}, versions of
|
|
Because of this, many systems have multiple
|
|
versions of @command{awk}.
|
|
Some systems have an @command{awk} utility that implements the
|
|
original version of the @command{awk} language and a @command{nawk} utility
|
|
for the new
|
|
version.
|
|
Others have an @command{oawk} version for the ``old @command{awk}''
|
|
language and plain @command{awk} for the new one. Still others only
|
|
have one version, which is usually the new one.@footnote{Often, these systems
|
|
use @command{gawk} for their @command{awk} implementation!}
|
|
|
|
@cindex @command{nawk} utility
|
|
@cindex @command{oawk} utility
|
|
All in all, this makes it difficult for you to know which version of
|
|
@command{awk} you should run when writing your programs. The best advice
|
|
I can give here is to check your local documentation. Look for @command{awk},
|
|
@command{oawk}, and @command{nawk}, as well as for @command{gawk}.
|
|
It is likely that you already
|
|
have some version of new @command{awk} on your system, which is what
|
|
you should use when running your programs. (Of course, if you're reading
|
|
this @value{DOCUMENT}, chances are good that you have @command{gawk}!)
|
|
|
|
Throughout this @value{DOCUMENT}, whenever we refer to a language feature
|
|
that should be available in any complete implementation of POSIX @command{awk},
|
|
we simply use the term @command{awk}. When referring to a feature that is
|
|
specific to the GNU implementation, we use the term @command{gawk}.
|
|
|
|
@node This Manual
|
|
@section Using This Book
|
|
@cindex @command{awk}, terms describing
|
|
|
|
The term @command{awk} refers to a particular program as well as to the language you
|
|
use to tell this program what to do. When we need to be careful, we call
|
|
the language ``the @command{awk} language,''
|
|
and the program ``the @command{awk} utility.''
|
|
This @value{DOCUMENT} explains
|
|
both the @command{awk} language and how to run the @command{awk} utility.
|
|
The term @dfn{@command{awk} program} refers to a program written by you in
|
|
the @command{awk} programming language.
|
|
|
|
@cindex @command{gawk}, @command{awk} and
|
|
@cindex @command{awk}, @command{gawk} and
|
|
@cindex POSIX @command{awk}
|
|
Primarily, this @value{DOCUMENT} explains the features of @command{awk},
|
|
as defined in the POSIX standard. It does so in the context of the
|
|
@command{gawk} implementation. While doing so, it also
|
|
attempts to describe important differences between @command{gawk}
|
|
and other @command{awk} implementations.@footnote{All such differences
|
|
appear in the index under the
|
|
entry ``differences in @command{awk} and @command{gawk}.''}
|
|
Finally, any @command{gawk} features that are not in
|
|
the POSIX standard for @command{awk} are noted.
|
|
|
|
@ifnotinfo
|
|
This @value{DOCUMENT} has the difficult task of being both a tutorial and a reference.
|
|
If you are a novice, feel free to skip over details that seem too complex.
|
|
You should also ignore the many cross-references; they are for the
|
|
expert user and for the online Info version of the document.
|
|
@end ifnotinfo
|
|
|
|
There are
|
|
subsections labelled
|
|
as @strong{Advanced Notes}
|
|
scattered throughout the @value{DOCUMENT}.
|
|
They add a more complete explanation of points that are relevant, but not likely
|
|
to be of interest on first reading.
|
|
All appear in the index, under the heading ``advanced features.''
|
|
|
|
Most of the time, the examples use complete @command{awk} programs.
|
|
In some of the more advanced sections, only the part of the @command{awk}
|
|
program that illustrates the concept currently being described is shown.
|
|
|
|
While this @value{DOCUMENT} is aimed principally at people who have not been
|
|
exposed
|
|
to @command{awk}, there is a lot of information here that even the @command{awk}
|
|
expert should find useful. In particular, the description of POSIX
|
|
@command{awk} and the example programs in
|
|
@ref{Library Functions}, and in
|
|
@ref{Sample Programs},
|
|
should be of interest.
|
|
|
|
@ref{Getting Started},
|
|
provides the essentials you need to know to begin using @command{awk}.
|
|
|
|
@ref{Regexp},
|
|
introduces regular expressions in general, and in particular the flavors
|
|
supported by POSIX @command{awk} and @command{gawk}.
|
|
|
|
@ref{Reading Files},
|
|
describes how @command{awk} reads your data.
|
|
It introduces the concepts of records and fields, as well
|
|
as the @code{getline} command.
|
|
I/O redirection is first described here.
|
|
|
|
@ref{Printing},
|
|
describes how @command{awk} programs can produce output with
|
|
@code{print} and @code{printf}.
|
|
|
|
@ref{Expressions},
|
|
describes expressions, which are the basic building blocks
|
|
for getting most things done in a program.
|
|
|
|
@ref{Patterns and Actions},
|
|
describes how to write patterns for matching records, actions for
|
|
doing something when a record is matched, and the built-in variables
|
|
@command{awk} and @command{gawk} use.
|
|
|
|
@ref{Arrays},
|
|
covers @command{awk}'s one-and-only data structure: associative arrays.
|
|
Deleting array elements and whole arrays is also described, as well as
|
|
sorting arrays in @command{gawk}.
|
|
|
|
@ref{Functions},
|
|
describes the built-in functions @command{awk} and
|
|
@command{gawk} provide, as well as how to define
|
|
your own functions.
|
|
|
|
@ref{Internationalization},
|
|
describes special features in @command{gawk} for translating program
|
|
messages into different languages at runtime.
|
|
|
|
@ref{Advanced Features},
|
|
describes a number of @command{gawk}-specific advanced features.
|
|
Of particular note
|
|
are the abilities to have two-way communications with another process,
|
|
perform TCP/IP networking, and
|
|
profile your @command{awk} programs.
|
|
|
|
@ref{Invoking Gawk},
|
|
describes how to run @command{gawk}, the meaning of its
|
|
command-line options, and how it finds @command{awk}
|
|
program source files.
|
|
|
|
@ref{Library Functions}, and
|
|
@ref{Sample Programs},
|
|
provide many sample @command{awk} programs.
|
|
Reading them allows you to see @command{awk}
|
|
solving real problems.
|
|
|
|
@ref{Language History},
|
|
describes how the @command{awk} language has evolved since
|
|
first release to present. It also describes how @command{gawk}
|
|
has acquired features over time.
|
|
|
|
@ref{Installation},
|
|
describes how to get @command{gawk}, how to compile it
|
|
under Unix, and how to compile and use it on different
|
|
non-Unix systems. It also describes how to report bugs
|
|
in @command{gawk} and where to get three other freely
|
|
available implementations of @command{awk}.
|
|
|
|
@ref{Notes},
|
|
describes how to disable @command{gawk}'s extensions, as
|
|
well as how to contribute new code to @command{gawk},
|
|
how to write extension libraries, and some possible
|
|
future directions for @command{gawk} development.
|
|
|
|
@ref{Basic Concepts},
|
|
provides some very cursory background material for those who
|
|
are completely unfamiliar with computer programming.
|
|
Also centralized there is a discussion of some of the issues
|
|
surrounding floating-point numbers.
|
|
|
|
The
|
|
@ref{Glossary},
|
|
defines most, if not all, the significant terms used
|
|
throughout the book.
|
|
If you find terms that you aren't familiar with, try looking them up here.
|
|
|
|
@ref{Copying}, and
|
|
@ref{GNU Free Documentation License},
|
|
present the licenses that cover the @command{gawk} source code
|
|
and this @value{DOCUMENT}, respectively.
|
|
|
|
@node Conventions
|
|
@section Typographical Conventions
|
|
|
|
@cindex Texinfo
|
|
This @value{DOCUMENT} is written using Texinfo, the GNU documentation
|
|
formatting language.
|
|
A single Texinfo source file is used to produce both the printed and online
|
|
versions of the documentation.
|
|
@ifnotinfo
|
|
Because of this, the typographical conventions
|
|
are slightly different than in other books you may have read.
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
This @value{SECTION} briefly documents the typographical conventions used in Texinfo.
|
|
@end ifinfo
|
|
|
|
Examples you would type at the command-line are preceded by the common
|
|
shell primary and secondary prompts, @samp{$} and @samp{>}.
|
|
Output from the command is preceded by the glyph ``@print{}''.
|
|
This typically represents the command's standard output.
|
|
Error messages, and other output on the command's standard error, are preceded
|
|
by the glyph ``@error{}''. For example:
|
|
|
|
@example
|
|
$ echo hi on stdout
|
|
@print{} hi on stdout
|
|
$ echo hello on stderr 1>&2
|
|
@error{} hello on stderr
|
|
@end example
|
|
|
|
@ifnotinfo
|
|
In the text, command names appear in @code{this font}, while code segments
|
|
appear in the same font and quoted, @samp{like this}. Some things are
|
|
emphasized @emph{like this}, and if a point needs to be made
|
|
strongly, it is done @strong{like this}. The first occurrence of
|
|
a new term is usually its @dfn{definition} and appears in the same
|
|
font as the previous occurrence of ``definition'' in this sentence.
|
|
@value{FN}s are indicated like this: @file{/path/to/ourfile}.
|
|
@end ifnotinfo
|
|
|
|
Characters that you type at the keyboard look @kbd{like this}. In particular,
|
|
there are special characters called ``control characters.'' These are
|
|
characters that you type by holding down both the @kbd{CONTROL} key and
|
|
another key, at the same time. For example, a @kbd{@value{CTL}-d} is typed
|
|
by first pressing and holding the @kbd{CONTROL} key, next
|
|
pressing the @kbd{d} key and finally releasing both keys.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subsubheading Dark Corners
|
|
@cindex Kernighan, Brian
|
|
@quotation
|
|
@i{Dark corners are basically fractal --- no matter how much
|
|
you illuminate, there's always a smaller but darker one.}@*
|
|
Brian Kernighan
|
|
@end quotation
|
|
|
|
@cindex d.c., See dark corner
|
|
@cindex dark corner
|
|
Until the POSIX standard (and @cite{The Gawk Manual}),
|
|
many features of @command{awk} were either poorly documented or not
|
|
documented at all. Descriptions of such features
|
|
(often called ``dark corners'') are noted in this @value{DOCUMENT} with
|
|
@iftex
|
|
the picture of a flashlight in the margin, as shown here.
|
|
@value{DARKCORNER}
|
|
@end iftex
|
|
@ifnottex
|
|
``(d.c.)''.
|
|
@end ifnottex
|
|
They also appear in the index under the heading ``dark corner.''
|
|
|
|
As noted by the opening quote, though, any
|
|
coverage of dark corners
|
|
is, by definition, something that is incomplete.
|
|
|
|
@node Manual History
|
|
@unnumberedsec The GNU Project and This Book
|
|
|
|
@cindex FSF (Free Software Foundation)
|
|
@cindex Free Software Foundation (FSF)
|
|
@cindex Stallman, Richard
|
|
The Free Software Foundation (FSF) is a nonprofit organization dedicated
|
|
to the production and distribution of freely distributable software.
|
|
It was founded by Richard M.@: Stallman, the author of the original
|
|
Emacs editor. GNU Emacs is the most widely used version of Emacs today.
|
|
|
|
@cindex GNU Project
|
|
@cindex GPL (General Public License)
|
|
@cindex General Public License, See GPL
|
|
@cindex documentation, online
|
|
The GNU@footnote{GNU stands for ``GNU's not Unix.''}
|
|
Project is an ongoing effort on the part of the Free Software
|
|
Foundation to create a complete, freely distributable, POSIX-compliant
|
|
computing environment.
|
|
The FSF uses the ``GNU General Public License'' (GPL) to ensure that
|
|
their software's
|
|
source code is always available to the end user. A
|
|
copy of the GPL is included
|
|
@ifnotinfo
|
|
in this @value{DOCUMENT}
|
|
@end ifnotinfo
|
|
for your reference
|
|
(@pxref{Copying}).
|
|
The GPL applies to the C language source code for @command{gawk}.
|
|
To find out more about the FSF and the GNU Project online,
|
|
see @uref{http://www.gnu.org, the GNU Project's home page}.
|
|
This @value{DOCUMENT} may also be read from
|
|
@uref{http://www.gnu.org/manual/gawk/, their web site}.
|
|
|
|
A shell, an editor (Emacs), highly portable optimizing C, C++, and
|
|
Objective-C compilers, a symbolic debugger and dozens of large and
|
|
small utilities (such as @command{gawk}), have all been completed and are
|
|
freely available. The GNU operating
|
|
system kernel (the HURD), has been released but is still in an early
|
|
stage of development.
|
|
|
|
@cindex Linux
|
|
@cindex GNU/Linux
|
|
@cindex operating systems, BSD-based
|
|
@cindex Alpha (DEC)
|
|
Until the GNU operating system is more fully developed, you should
|
|
consider using GNU/Linux, a freely distributable, Unix-like operating
|
|
system for Intel 80386, DEC Alpha, Sun SPARC, IBM S/390, and other
|
|
systems.@footnote{The terminology ``GNU/Linux'' is explained
|
|
in the @ref{Glossary}.}
|
|
There are
|
|
many books on GNU/Linux. One that is freely available is @cite{Linux
|
|
Installation and Getting Started}, by Matt Welsh.
|
|
Many GNU/Linux distributions are often available in computer stores or
|
|
bundled on CD-ROMs with books about Linux.
|
|
(There are three other freely available, Unix-like operating systems for
|
|
80386 and other systems: NetBSD, FreeBSD, and OpenBSD. All are based on the
|
|
4.4-Lite Berkeley Software Distribution, and they use recent versions
|
|
of @command{gawk} for their versions of @command{awk}.)
|
|
|
|
@ifnotinfo
|
|
The @value{DOCUMENT} you are reading is actually free---at least, the
|
|
information in it is free to anyone. The machine-readable
|
|
source code for the @value{DOCUMENT} comes with @command{gawk}; anyone
|
|
may take this @value{DOCUMENT} to a copying machine and make as many
|
|
copies as they like. (Take a moment to check the Free Documentation
|
|
License in @ref{GNU Free Documentation License}.)
|
|
|
|
Although you could just print it out yourself, bound books are much
|
|
easier to read and use. Furthermore,
|
|
the proceeds from sales of this book go back to the FSF
|
|
to help fund development of more free software.
|
|
@end ifnotinfo
|
|
|
|
@ignore
|
|
@cindex Close, Diane
|
|
The @value{DOCUMENT} itself has gone through several previous,
|
|
preliminary editions.
|
|
Paul Rubin wrote the very first draft of @cite{The GAWK Manual};
|
|
it was around 40 pages in size.
|
|
Diane Close and Richard Stallman improved it, yielding the
|
|
version which I started working with in the fall of 1988.
|
|
It was around 90 pages long and barely described the original, ``old''
|
|
version of @command{awk}. After substantial revision, the first version of
|
|
the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in
|
|
October of 1989. The manual then underwent more substantial revision
|
|
for Edition 0.13 of December 1991.
|
|
David Trueman, Pat Rankin and Michal Jaegermann contributed sections
|
|
of the manual for Edition 0.13.
|
|
That edition was published by the
|
|
FSF as a bound book early in 1992. Since then there were several
|
|
minor revisions, notably Edition 0.14 of November 1992 that was published
|
|
by the FSF in January of 1993 and Edition 0.16 of August 1993.
|
|
|
|
Edition 1.0 of @cite{GAWK: The GNU Awk User's Guide} represented a significant re-working
|
|
of @cite{The GAWK Manual}, with much additional material.
|
|
The FSF and I agreed that I was now the primary author.
|
|
@c I also felt that the manual needed a more descriptive title.
|
|
|
|
In January 1996, SSC published Edition 1.0 under the title @cite{Effective AWK Programming}.
|
|
In February 1997, they published Edition 1.0.3 which had minor changes
|
|
as a ``second edition.''
|
|
In 1999, the FSF published this same version as Edition 2
|
|
of @cite{GAWK: The GNU Awk User's Guide}.
|
|
|
|
Edition @value{EDITION} maintains the basic structure of Edition 1.0,
|
|
but with significant additional material, reflecting the host of new features
|
|
in @command{gawk} @value{PVERSION} @value{VERSION}.
|
|
Of particular note is
|
|
@ref{Array Sorting},
|
|
@ref{Bitwise Functions},
|
|
@ref{Internationalization},
|
|
@ref{Advanced Features},
|
|
and
|
|
@ref{Dynamic Extensions}.
|
|
@end ignore
|
|
|
|
@cindex Close, Diane
|
|
The @value{DOCUMENT} itself has gone through a number of previous editions.
|
|
Paul Rubin wrote the very first draft of @cite{The GAWK Manual};
|
|
it was around 40 pages in size.
|
|
Diane Close and Richard Stallman improved it, yielding a
|
|
version that was
|
|
around 90 pages long and barely described the original, ``old''
|
|
version of @command{awk}.
|
|
|
|
I started working with that version in the fall of 1988.
|
|
As work on it progressed,
|
|
the FSF published several preliminary versions (numbered 0.@var{x}).
|
|
In 1996, Edition 1.0 was released with @command{gawk} 3.0.0.
|
|
The FSF published the first two editions under
|
|
the title @cite{The GNU Awk User's Guide}.
|
|
|
|
This edition maintains the basic structure of Edition 1.0,
|
|
but with significant additional material, reflecting the host of new features
|
|
in @command{gawk} @value{PVERSION} @value{VERSION}.
|
|
Of particular note is
|
|
@ref{Array Sorting},
|
|
as well as
|
|
@ref{Bitwise Functions},
|
|
@ref{Internationalization},
|
|
and also
|
|
@ref{Advanced Features},
|
|
and
|
|
@ref{Dynamic Extensions}.
|
|
|
|
@cite{@value{TITLE}} will undoubtedly continue to evolve.
|
|
An electronic version
|
|
comes with the @command{gawk} distribution from the FSF.
|
|
If you find an error in this @value{DOCUMENT}, please report it!
|
|
@xref{Bugs}, for information on submitting
|
|
problem reports electronically, or write to me in care of the publisher.
|
|
|
|
@node How To Contribute
|
|
@unnumberedsec How to Contribute
|
|
|
|
As the maintainer of GNU @command{awk},
|
|
I am starting a collection of publicly available @command{awk}
|
|
programs.
|
|
For more information,
|
|
see @uref{ftp://ftp.freefriends.org/arnold/Awkstuff}.
|
|
If you have written an interesting @command{awk} program, or have written a
|
|
@command{gawk} extension that you would like to
|
|
share with the rest of the world, please contact me (@email{arnold@@gnu.org}).
|
|
Making things available on the Internet helps keep the
|
|
@command{gawk} distribution down to manageable size.
|
|
|
|
@node Acknowledgments
|
|
@unnumberedsec Acknowledgments
|
|
|
|
The initial draft of @cite{The GAWK Manual} had the following acknowledgments:
|
|
|
|
@quotation
|
|
Many people need to be thanked for their assistance in producing this
|
|
manual. Jay Fenlason contributed many ideas and sample programs. Richard
|
|
Mlynarik and Robert Chassell gave helpful comments on drafts of this
|
|
manual. The paper @cite{A Supplemental Document for @command{awk}} by John W.@:
|
|
Pierce of the Chemistry Department at UC San Diego, pinpointed several
|
|
issues relevant both to @command{awk} implementation and to this manual, that
|
|
would otherwise have escaped us.
|
|
@end quotation
|
|
|
|
@cindex Stallman, Richard
|
|
I would like to acknowledge Richard M.@: Stallman, for his vision of a
|
|
better world and for his courage in founding the FSF and starting the
|
|
GNU Project.
|
|
|
|
The following people (in alphabetical order)
|
|
provided helpful comments on various
|
|
versions of this book, up to and including this edition.
|
|
Rick Adams,
|
|
Nelson H.F. Beebe,
|
|
Karl Berry,
|
|
Dr.@: Michael Brennan,
|
|
Rich Burridge,
|
|
Claire Cloutier,
|
|
Diane Close,
|
|
Scott Deifik,
|
|
Christopher (``Topher'') Eliot,
|
|
Jeffrey Friedl,
|
|
Dr.@: Darrel Hankerson,
|
|
Michal Jaegermann,
|
|
Dr.@: Richard J.@: LeBlanc,
|
|
Michael Lijewski,
|
|
Pat Rankin,
|
|
Miriam Robbins,
|
|
Mary Sheehan,
|
|
and
|
|
Chuck Toporek.
|
|
|
|
@cindex Berry, Karl
|
|
@cindex Chassell, Robert J.@:
|
|
@c @cindex Texinfo
|
|
Robert J.@: Chassell provided much valuable advice on
|
|
the use of Texinfo.
|
|
He also deserves special thanks for
|
|
convincing me @emph{not} to title this @value{DOCUMENT}
|
|
@cite{How To Gawk Politely}.
|
|
Karl Berry helped significantly with the @TeX{} part of Texinfo.
|
|
|
|
@cindex Hartholz, Marshall
|
|
@cindex Hartholz, Elaine
|
|
@cindex Schreiber, Bert
|
|
@cindex Schreiber, Rita
|
|
I would like to thank Marshall and Elaine Hartholz of Seattle and
|
|
Dr.@: Bert and Rita Schreiber of Detroit for large amounts of quiet vacation
|
|
time in their homes, which allowed me to make significant progress on
|
|
this @value{DOCUMENT} and on @command{gawk} itself.
|
|
|
|
@cindex Hughes, Phil
|
|
Phil Hughes of SSC
|
|
contributed in a very important way by loaning me his laptop GNU/Linux
|
|
system, not once, but twice, which allowed me to do a lot of work while
|
|
away from home.
|
|
|
|
@cindex Trueman, David
|
|
David Trueman deserves special credit; he has done a yeoman job
|
|
of evolving @command{gawk} so that it performs well and without bugs.
|
|
Although he is no longer involved with @command{gawk},
|
|
working with him on this project was a significant pleasure.
|
|
|
|
@cindex Drepper, Ulrich
|
|
@cindex GNITS mailing list
|
|
@cindex mailing list, GNITS
|
|
The intrepid members of the GNITS mailing list, and most notably Ulrich
|
|
Drepper, provided invaluable help and feedback for the design of the
|
|
internationalization features.
|
|
|
|
@cindex Beebe, Nelson
|
|
@cindex Brown, Martin
|
|
@cindex Buening, Andreas
|
|
@cindex Deifik, Scott
|
|
@cindex Hankerson, Darrel
|
|
@cindex Hasegawa, Isamu
|
|
@cindex Jaegermann, Michal
|
|
@cindex Kahrs, J@"urgen
|
|
@cindex Rankin, Pat
|
|
@cindex Rommel, Kai Uwe
|
|
@cindex Zaretskii, Eli
|
|
Nelson Beebe,
|
|
Martin Brown,
|
|
Andreas Buening,
|
|
Scott Deifik,
|
|
Darrel Hankerson,
|
|
Isamu Hasegawa,
|
|
Michal Jaegermann,
|
|
J@"urgen Kahrs,
|
|
Pat Rankin,
|
|
Kai Uwe Rommel,
|
|
and Eli Zaretskii
|
|
(in alphabetical order)
|
|
make up the
|
|
@command{gawk} ``crack portability team.'' Without their hard work and
|
|
help, @command{gawk} would not be nearly the fine program it is today. It
|
|
has been and continues to be a pleasure working with this team of fine
|
|
people.
|
|
|
|
@cindex Kernighan, Brian
|
|
David and I would like to thank Brian Kernighan of Bell Laboratories for
|
|
invaluable assistance during the testing and debugging of @command{gawk}, and for
|
|
help in clarifying numerous points about the language. We could not have
|
|
done nearly as good a job on either @command{gawk} or its documentation without
|
|
his help.
|
|
|
|
Chuck Toporek, Mary Sheehan, and Claire Coutier of O'Reilly & Associates contributed
|
|
significant editorial help for this @value{DOCUMENT} for the
|
|
3.1 release of @command{gawk}.
|
|
|
|
@cindex Robbins, Miriam
|
|
@cindex Robbins, Jean
|
|
@cindex Robbins, Harry
|
|
@cindex G-d
|
|
I must thank my wonderful wife, Miriam, for her patience through
|
|
the many versions of this project, for her proofreading,
|
|
and for sharing me with the computer.
|
|
I would like to thank my parents for their love, and for the grace with
|
|
which they raised and educated me.
|
|
Finally, I also must acknowledge my gratitude to G-d, for the many opportunities
|
|
He has sent my way, as well as for the gifts He has given me with which to
|
|
take advantage of those opportunities.
|
|
@sp 2
|
|
@noindent
|
|
Arnold Robbins @*
|
|
Nof Ayalon @*
|
|
ISRAEL @*
|
|
March, 2001
|
|
|
|
@ignore
|
|
@c Try this
|
|
@iftex
|
|
@page
|
|
@headings off
|
|
@majorheading I@ @ @ @ The @command{awk} Language and @command{gawk}
|
|
Part I describes the @command{awk} language and @command{gawk} program in detail.
|
|
It starts with the basics, and continues through all of the features of @command{awk}
|
|
and @command{gawk}. It contains the following chapters:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@ref{Getting Started}.
|
|
|
|
@item
|
|
@ref{Regexp}.
|
|
|
|
@item
|
|
@ref{Reading Files}.
|
|
|
|
@item
|
|
@ref{Printing}.
|
|
|
|
@item
|
|
@ref{Expressions}.
|
|
|
|
@item
|
|
@ref{Patterns and Actions}.
|
|
|
|
@item
|
|
@ref{Arrays}.
|
|
|
|
@item
|
|
@ref{Functions}.
|
|
|
|
@item
|
|
@ref{Internationalization}.
|
|
|
|
@item
|
|
@ref{Advanced Features}.
|
|
|
|
@item
|
|
@ref{Invoking Gawk}.
|
|
@end itemize
|
|
|
|
@page
|
|
@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
|
|
@oddheading @| @| @strong{@thischapter}@ @ @ @thispage
|
|
@end iftex
|
|
@end ignore
|
|
|
|
@node Getting Started
|
|
@chapter Getting Started with @command{awk}
|
|
@c @cindex script, definition of
|
|
@c @cindex rule, definition of
|
|
@c @cindex program, definition of
|
|
@c @cindex basic function of @command{awk}
|
|
@cindex @command{awk}, function of
|
|
|
|
The basic function of @command{awk} is to search files for lines (or other
|
|
units of text) that contain certain patterns. When a line matches one
|
|
of the patterns, @command{awk} performs specified actions on that line.
|
|
@command{awk} keeps processing input lines in this way until it reaches
|
|
the end of the input files.
|
|
|
|
@cindex @command{awk}, uses for
|
|
@c comma here is NOT for secondary
|
|
@cindex programming languages, data-driven vs. procedural
|
|
@cindex @command{awk} programs
|
|
Programs in @command{awk} are different from programs in most other languages,
|
|
because @command{awk} programs are @dfn{data-driven}; that is, you describe
|
|
the data you want to work with and then what to do when you find it.
|
|
Most other languages are @dfn{procedural}; you have to describe, in great
|
|
detail, every step the program is to take. When working with procedural
|
|
languages, it is usually much
|
|
harder to clearly describe the data your program will process.
|
|
For this reason, @command{awk} programs are often refreshingly easy to
|
|
read and write.
|
|
|
|
@cindex program, definition of
|
|
@cindex rule, definition of
|
|
When you run @command{awk}, you specify an @command{awk} @dfn{program} that
|
|
tells @command{awk} what to do. The program consists of a series of
|
|
@dfn{rules}. (It may also contain @dfn{function definitions},
|
|
an advanced feature that we will ignore for now.
|
|
@xref{User-defined}.) Each rule specifies one
|
|
pattern to search for and one action to perform
|
|
upon finding the pattern.
|
|
|
|
Syntactically, a rule consists of a pattern followed by an action. The
|
|
action is enclosed in curly braces to separate it from the pattern.
|
|
Newlines usually separate rules. Therefore, an @command{awk}
|
|
program looks like this:
|
|
|
|
@example
|
|
@var{pattern} @{ @var{action} @}
|
|
@var{pattern} @{ @var{action} @}
|
|
@dots{}
|
|
@end example
|
|
|
|
@menu
|
|
* Running gawk:: How to run @command{gawk} programs; includes
|
|
command-line syntax.
|
|
* Sample Data Files:: Sample data files for use in the @command{awk}
|
|
programs illustrated in this @value{DOCUMENT}.
|
|
* Very Simple:: A very simple example.
|
|
* Two Rules:: A less simple one-line example using two
|
|
rules.
|
|
* More Complex:: A more complex example.
|
|
* Statements/Lines:: Subdividing or combining statements into
|
|
lines.
|
|
* Other Features:: Other Features of @command{awk}.
|
|
* When:: When to use @command{gawk} and when to use
|
|
other things.
|
|
@end menu
|
|
|
|
@node Running gawk
|
|
@section How to Run @command{awk} Programs
|
|
|
|
@cindex @command{awk} programs, running
|
|
There are several ways to run an @command{awk} program. If the program is
|
|
short, it is easiest to include it in the command that runs @command{awk},
|
|
like this:
|
|
|
|
@example
|
|
awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
@cindex command line, formats
|
|
When the program is long, it is usually more convenient to put it in a file
|
|
and run it with a command like this:
|
|
|
|
@example
|
|
awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
This @value{SECTION} discusses both mechanisms, along with several
|
|
variations of each.
|
|
|
|
@menu
|
|
* One-shot:: Running a short throwaway @command{awk}
|
|
program.
|
|
* Read Terminal:: Using no input files (input from terminal
|
|
instead).
|
|
* Long:: Putting permanent @command{awk} programs in
|
|
files.
|
|
* Executable Scripts:: Making self-contained @command{awk} programs.
|
|
* Comments:: Adding documentation to @command{gawk}
|
|
programs.
|
|
* Quoting:: More discussion of shell quoting issues.
|
|
@end menu
|
|
|
|
@node One-shot
|
|
@subsection One-Shot Throwaway @command{awk} Programs
|
|
|
|
Once you are familiar with @command{awk}, you will often type in simple
|
|
programs the moment you want to use them. Then you can write the
|
|
program as the first argument of the @command{awk} command, like this:
|
|
|
|
@example
|
|
awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
where @var{program} consists of a series of @var{patterns} and
|
|
@var{actions}, as described earlier.
|
|
|
|
@cindex single quote (@code{'})
|
|
@cindex @code{'} (single quote)
|
|
This command format instructs the @dfn{shell}, or command interpreter,
|
|
to start @command{awk} and use the @var{program} to process records in the
|
|
input file(s). There are single quotes around @var{program} so
|
|
the shell won't interpret any @command{awk} characters as special shell
|
|
characters. The quotes also cause the shell to treat all of @var{program} as
|
|
a single argument for @command{awk}, and allow @var{program} to be more
|
|
than one line long.
|
|
|
|
@cindex shells, scripts
|
|
@cindex @command{awk} programs, running, from shell scripts
|
|
This format is also useful for running short or medium-sized @command{awk}
|
|
programs from shell scripts, because it avoids the need for a separate
|
|
file for the @command{awk} program. A self-contained shell script is more
|
|
reliable because there are no other files to misplace.
|
|
|
|
@ref{Very Simple},
|
|
@ifnotinfo
|
|
later in this @value{CHAPTER},
|
|
@end ifnotinfo
|
|
presents several short,
|
|
self-contained programs.
|
|
|
|
@c Removed for gawk 3.1, doesn't really add anything here.
|
|
@ignore
|
|
As an interesting side point, the command
|
|
|
|
@example
|
|
awk '/foo/' @var{files} @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
is essentially the same as
|
|
|
|
@cindex @command{egrep} utility
|
|
@example
|
|
egrep foo @var{files} @dots{}
|
|
@end example
|
|
@end ignore
|
|
|
|
@node Read Terminal
|
|
@subsection Running @command{awk} Without Input Files
|
|
|
|
@cindex standard input
|
|
@cindex input, standard
|
|
@cindex input files, running @command{awk} without
|
|
You can also run @command{awk} without any input files. If you type the
|
|
following command line:
|
|
|
|
@example
|
|
awk '@var{program}'
|
|
@end example
|
|
|
|
@noindent
|
|
@command{awk} applies the @var{program} to the @dfn{standard input},
|
|
which usually means whatever you type on the terminal. This continues
|
|
until you indicate end-of-file by typing @kbd{@value{CTL}-d}.
|
|
(On other operating systems, the end-of-file character may be different.
|
|
For example, on OS/2 and MS-DOS, it is @kbd{@value{CTL}-z}.)
|
|
|
|
@cindex files, input, See input files
|
|
@cindex input files, running @command{awk} without
|
|
@cindex @command{awk} programs, running, without input files
|
|
As an example, the following program prints a friendly piece of advice
|
|
(from Douglas Adams's @cite{The Hitchhiker's Guide to the Galaxy}),
|
|
to keep you from worrying about the complexities of computer programming
|
|
(@code{BEGIN} is a feature we haven't discussed yet):
|
|
|
|
@example
|
|
$ awk "BEGIN @{ print \"Don't Panic!\" @}"
|
|
@print{} Don't Panic!
|
|
@end example
|
|
|
|
@cindex quoting
|
|
@cindex double quote (@code{"})
|
|
@cindex @code{"} (double quote)
|
|
@cindex @code{\} (backslash)
|
|
@cindex backslash (@code{\})
|
|
This program does not read any input. The @samp{\} before each of the
|
|
inner double quotes is necessary because of the shell's quoting
|
|
rules---in particular because it mixes both single quotes and
|
|
double quotes.@footnote{Although we generally recommend the use of single
|
|
quotes around the program text, double quotes are needed here in order to
|
|
put the single quote into the message.}
|
|
|
|
This next simple @command{awk} program
|
|
emulates the @command{cat} utility; it copies whatever you type on the
|
|
keyboard to its standard output (why this works is explained shortly).
|
|
|
|
@example
|
|
$ awk '@{ print @}'
|
|
Now is the time for all good men
|
|
@print{} Now is the time for all good men
|
|
to come to the aid of their country.
|
|
@print{} to come to the aid of their country.
|
|
Four score and seven years ago, ...
|
|
@print{} Four score and seven years ago, ...
|
|
What, me worry?
|
|
@print{} What, me worry?
|
|
@kbd{@value{CTL}-d}
|
|
@end example
|
|
|
|
@node Long
|
|
@subsection Running Long Programs
|
|
|
|
@cindex @command{awk} programs, running
|
|
@cindex @command{awk} programs, lengthy
|
|
@cindex files, @command{awk} programs in
|
|
Sometimes your @command{awk} programs can be very long. In this case, it is
|
|
more convenient to put the program into a separate file. In order to tell
|
|
@command{awk} to use that file for its program, you type:
|
|
|
|
@example
|
|
awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
@cindex @code{-f} option
|
|
@cindex command line, options
|
|
@cindex options, command-line
|
|
The @option{-f} instructs the @command{awk} utility to get the @command{awk} program
|
|
from the file @var{source-file}. Any @value{FN} can be used for
|
|
@var{source-file}. For example, you could put the program:
|
|
|
|
@example
|
|
BEGIN @{ print "Don't Panic!" @}
|
|
@end example
|
|
|
|
@noindent
|
|
into the file @file{advice}. Then this command:
|
|
|
|
@example
|
|
awk -f advice
|
|
@end example
|
|
|
|
@noindent
|
|
does the same thing as this one:
|
|
|
|
@example
|
|
awk "BEGIN @{ print \"Don't Panic!\" @}"
|
|
@end example
|
|
|
|
@cindex quoting
|
|
@noindent
|
|
This was explained earlier
|
|
(@pxref{Read Terminal}).
|
|
Note that you don't usually need single quotes around the @value{FN} that you
|
|
specify with @option{-f}, because most @value{FN}s don't contain any of the shell's
|
|
special characters. Notice that in @file{advice}, the @command{awk}
|
|
program did not have single quotes around it. The quotes are only needed
|
|
for programs that are provided on the @command{awk} command line.
|
|
|
|
@c STARTOFRANGE sq1x
|
|
@cindex single quote (@code{'})
|
|
@c STARTOFRANGE qs2x
|
|
@cindex @code{'} (single quote)
|
|
If you want to identify your @command{awk} program files clearly as such,
|
|
you can add the extension @file{.awk} to the @value{FN}. This doesn't
|
|
affect the execution of the @command{awk} program but it does make
|
|
``housekeeping'' easier.
|
|
|
|
@node Executable Scripts
|
|
@subsection Executable @command{awk} Programs
|
|
@cindex @command{awk} programs
|
|
@cindex @code{#} (number sign), @code{#!} (executable scripts)
|
|
@cindex number sign (@code{#}), @code{#!} (executable scripts)
|
|
@cindex Unix, @command{awk} scripts and
|
|
@cindex @code{#} (number sign), @code{#!} (executable scripts), portability issues with
|
|
@cindex number sign (@code{#}), @code{#!} (executable scripts), portability issues with
|
|
|
|
Once you have learned @command{awk}, you may want to write self-contained
|
|
@command{awk} scripts, using the @samp{#!} script mechanism. You can do
|
|
this on many Unix systems@footnote{The @samp{#!} mechanism works on
|
|
Linux systems,
|
|
systems derived from the 4.4-Lite Berkeley Software Distribution,
|
|
and most commercial Unix systems.} as well as on the GNU system.
|
|
For example, you could update the file @file{advice} to look like this:
|
|
|
|
@example
|
|
#! /bin/awk -f
|
|
|
|
BEGIN @{ print "Don't Panic!" @}
|
|
@end example
|
|
|
|
@noindent
|
|
After making this file executable (with the @command{chmod} utility),
|
|
simply type @samp{advice}
|
|
at the shell and the system arranges to run @command{awk}@footnote{The
|
|
line beginning with @samp{#!} lists the full @value{FN} of an interpreter
|
|
to run and an optional initial command-line argument to pass to that
|
|
interpreter. The operating system then runs the interpreter with the given
|
|
argument and the full argument list of the executed program. The first argument
|
|
in the list is the full @value{FN} of the @command{awk} program. The rest of the
|
|
argument list contains either options to @command{awk}, or @value{DF}s,
|
|
or both.} as if you had
|
|
typed @samp{awk -f advice}:
|
|
|
|
@example
|
|
$ chmod +x advice
|
|
$ advice
|
|
@print{} Don't Panic!
|
|
@end example
|
|
|
|
@noindent
|
|
(We assume you have the current directory in your shell's search
|
|
path variable (typically @code{$PATH}). If not, you may need
|
|
to type @samp{./advice} at the shell.)
|
|
|
|
Self-contained @command{awk} scripts are useful when you want to write a
|
|
program that users can invoke without their having to know that the program is
|
|
written in @command{awk}.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Portability Issues with @samp{#!}
|
|
@cindex portability, @code{#!} (executable scripts)
|
|
|
|
Some systems limit the length of the interpreter name to 32 characters.
|
|
Often, this can be dealt with by using a symbolic link.
|
|
|
|
You should not put more than one argument on the @samp{#!}
|
|
line after the path to @command{awk}. It does not work. The operating system
|
|
treats the rest of the line as a single argument and passes it to @command{awk}.
|
|
Doing this leads to confusing behavior---most likely a usage diagnostic
|
|
of some sort from @command{awk}.
|
|
|
|
@cindex @code{ARGC}/@code{ARGV} variables, portability and
|
|
@cindex portability, @code{ARGV} variable
|
|
Finally,
|
|
the value of @code{ARGV[0]}
|
|
(@pxref{Built-in Variables})
|
|
varies depending upon your operating system.
|
|
Some systems put @samp{awk} there, some put the full pathname
|
|
of @command{awk} (such as @file{/bin/awk}), and some put the name
|
|
of your script (@samp{advice}). Don't rely on the value of @code{ARGV[0]}
|
|
to provide your script name.
|
|
|
|
@node Comments
|
|
@subsection Comments in @command{awk} Programs
|
|
@cindex @code{#} (number sign), commenting
|
|
@cindex number sign (@code{#}), commenting
|
|
@cindex commenting
|
|
@cindex @command{awk} programs, documenting
|
|
|
|
A @dfn{comment} is some text that is included in a program for the sake
|
|
of human readers; it is not really an executable part of the program. Comments
|
|
can explain what the program does and how it works. Nearly all
|
|
programming languages have provisions for comments, as programs are
|
|
typically hard to understand without them.
|
|
|
|
In the @command{awk} language, a comment starts with the sharp sign
|
|
character (@samp{#}) and continues to the end of the line.
|
|
The @samp{#} does not have to be the first character on the line. The
|
|
@command{awk} language ignores the rest of a line following a sharp sign.
|
|
For example, we could have put the following into @file{advice}:
|
|
|
|
@example
|
|
# This program prints a nice friendly message. It helps
|
|
# keep novice users from being afraid of the computer.
|
|
BEGIN @{ print "Don't Panic!" @}
|
|
@end example
|
|
|
|
You can put comment lines into keyboard-composed throwaway @command{awk}
|
|
programs, but this usually isn't very useful; the purpose of a
|
|
comment is to help you or another person understand the program
|
|
when reading it at a later time.
|
|
|
|
@cindex quoting
|
|
@cindex single quote (@code{'}), vs. apostrophe
|
|
@cindex @code{'} (single quote), vs. apostrophe
|
|
@strong{Caution:} As mentioned in
|
|
@ref{One-shot},
|
|
you can enclose small to medium programs in single quotes, in order to keep
|
|
your shell scripts self-contained. When doing so, @emph{don't} put
|
|
an apostrophe (i.e., a single quote) into a comment (or anywhere else
|
|
in your program). The shell interprets the quote as the closing
|
|
quote for the entire program. As a result, usually the shell
|
|
prints a message about mismatched quotes, and if @command{awk} actually
|
|
runs, it will probably print strange messages about syntax errors.
|
|
For example, look at the following:
|
|
|
|
@example
|
|
$ awk '@{ print "hello" @} # let's be cute'
|
|
>
|
|
@end example
|
|
|
|
The shell sees that the first two quotes match, and that
|
|
a new quoted object begins at the end of the command line.
|
|
It therefore prompts with the secondary prompt, waiting for more input.
|
|
With Unix @command{awk}, closing the quoted string produces this result:
|
|
|
|
@example
|
|
$ awk '@{ print "hello" @} # let's be cute'
|
|
> '
|
|
@error{} awk: can't open file be
|
|
@error{} source line number 1
|
|
@end example
|
|
|
|
@cindex @code{\} (backslash)
|
|
@cindex backslash (@code{\})
|
|
Putting a backslash before the single quote in @samp{let's} wouldn't help,
|
|
since backslashes are not special inside single quotes.
|
|
The next @value{SUBSECTION} describes the shell's quoting rules.
|
|
|
|
@node Quoting
|
|
@subsection Shell-Quoting Issues
|
|
@cindex quoting, rules for
|
|
|
|
For short to medium length @command{awk} programs, it is most convenient
|
|
to enter the program on the @command{awk} command line.
|
|
This is best done by enclosing the entire program in single quotes.
|
|
This is true whether you are entering the program interactively at
|
|
the shell prompt, or writing it as part of a larger shell script:
|
|
|
|
@example
|
|
awk '@var{program text}' @var{input-file1} @var{input-file2} @dots{}
|
|
@end example
|
|
|
|
@cindex shells, quoting, rules for
|
|
@cindex Bourne shell, quoting rules for
|
|
Once you are working with the shell, it is helpful to have a basic
|
|
knowledge of shell quoting rules. The following rules apply only to
|
|
POSIX-compliant, Bourne-style shells (such as @command{bash}, the GNU Bourne-Again
|
|
Shell). If you use @command{csh}, you're on your own.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Quoted items can be concatenated with nonquoted items as well as with other
|
|
quoted items. The shell turns everything into one argument for
|
|
the command.
|
|
|
|
@item
|
|
Preceding any single character with a backslash (@samp{\}) quotes
|
|
that character. The shell removes the backslash and passes the quoted
|
|
character on to the command.
|
|
|
|
@item
|
|
@cindex @code{\} (backslash)
|
|
@cindex backslash (@code{\})
|
|
@cindex single quote (@code{'})
|
|
@cindex @code{'} (single quote)
|
|
Single quotes protect everything between the opening and closing quotes.
|
|
The shell does no interpretation of the quoted text, passing it on verbatim
|
|
to the command.
|
|
It is @emph{impossible} to embed a single quote inside single-quoted text.
|
|
Refer back to
|
|
@ref{Comments},
|
|
for an example of what happens if you try.
|
|
|
|
@item
|
|
@cindex double quote (@code{"})
|
|
@cindex @code{"} (double quote)
|
|
Double quotes protect most things between the opening and closing quotes.
|
|
The shell does at least variable and command substitution on the quoted text.
|
|
Different shells may do additional kinds of processing on double-quoted text.
|
|
|
|
Since certain characters within double-quoted text are processed by the shell,
|
|
they must be @dfn{escaped} within the text. Of note are the characters
|
|
@samp{$}, @samp{`}, @samp{\}, and @samp{"}, all of which must be preceded by
|
|
a backslash within double-quoted text if they are to be passed on literally
|
|
to the program. (The leading backslash is stripped first.)
|
|
Thus, the example seen
|
|
@ifnotinfo
|
|
previously
|
|
@end ifnotinfo
|
|
in @ref{Read Terminal},
|
|
is applicable:
|
|
|
|
@example
|
|
$ awk "BEGIN @{ print \"Don't Panic!\" @}"
|
|
@print{} Don't Panic!
|
|
@end example
|
|
|
|
@cindex single quote (@code{'}), with double quotes
|
|
@cindex @code{'} (single quote), with double quotes
|
|
Note that the single quote is not special within double quotes.
|
|
|
|
@item
|
|
Null strings are removed when they occur as part of a non-null
|
|
command-line argument, while explicit non-null objects are kept.
|
|
For example, to specify that the field separator @code{FS} should
|
|
be set to the null string, use:
|
|
|
|
@example
|
|
awk -F "" '@var{program}' @var{files} # correct
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex null strings, quoting and
|
|
Don't use this:
|
|
|
|
@example
|
|
awk -F"" '@var{program}' @var{files} # wrong!
|
|
@end example
|
|
|
|
@noindent
|
|
In the second case, @command{awk} will attempt to use the text of the program
|
|
as the value of @code{FS}, and the first @value{FN} as the text of the program!
|
|
This results in syntax errors at best, and confusing behavior at worst.
|
|
@end itemize
|
|
|
|
@cindex quoting, tricks for
|
|
Mixing single and double quotes is difficult. You have to resort
|
|
to shell quoting tricks, like this:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print "Here is a single quote <'"'"'>" @}'
|
|
@print{} Here is a single quote <'>
|
|
@end example
|
|
|
|
@noindent
|
|
This program consists of three concatenated quoted strings. The first and the
|
|
third are single-quoted, the second is double-quoted.
|
|
|
|
This can be ``simplified'' to:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print "Here is a single quote <'\''>" @}'
|
|
@print{} Here is a single quote <'>
|
|
@end example
|
|
|
|
@noindent
|
|
Judge for yourself which of these two is the more readable.
|
|
|
|
Another option is to use double quotes, escaping the embedded, @command{awk}-level
|
|
double quotes:
|
|
|
|
@example
|
|
$ awk "BEGIN @{ print \"Here is a single quote <'>\" @}"
|
|
@print{} Here is a single quote <'>
|
|
@end example
|
|
|
|
@noindent
|
|
@c ENDOFRANGE sq1x
|
|
@c ENDOFRANGE qs2x
|
|
This option is also painful, because double quotes, backslashes, and dollar signs
|
|
are very common in @command{awk} programs.
|
|
|
|
If you really need both single and double quotes in your @command{awk}
|
|
program, it is probably best to move it into a separate file, where
|
|
the shell won't be part of the picture, and you can say what you mean.
|
|
|
|
@node Sample Data Files
|
|
@section @value{DDF}s for the Examples
|
|
@c For gawk >= 3.2, update these data files. No-one has such slow modems!
|
|
|
|
@cindex input files, examples
|
|
@cindex @code{BBS-list} file
|
|
Many of the examples in this @value{DOCUMENT} take their input from two sample
|
|
@value{DF}s. The first, @file{BBS-list}, represents a list of
|
|
computer bulletin board systems together with information about those systems.
|
|
The second @value{DF}, called @file{inventory-shipped}, contains
|
|
information about monthly shipments. In both files,
|
|
each line is considered to be one @dfn{record}.
|
|
|
|
In the @value{DF} @file{BBS-list}, each record contains the name of a computer
|
|
bulletin board, its phone number, the board's baud rate(s), and a code for
|
|
the number of hours it is operational. An @samp{A} in the last column
|
|
means the board operates 24 hours a day. A @samp{B} in the last
|
|
column means the board only operates on evening and weekend hours.
|
|
A @samp{C} means the board operates only on weekends:
|
|
|
|
@c 2e: Update the baud rates to reflect today's faster modems
|
|
@example
|
|
@c system if test ! -d eg ; then mkdir eg ; fi
|
|
@c system if test ! -d eg/lib ; then mkdir eg/lib ; fi
|
|
@c system if test ! -d eg/data ; then mkdir eg/data ; fi
|
|
@c system if test ! -d eg/prog ; then mkdir eg/prog ; fi
|
|
@c system if test ! -d eg/misc ; then mkdir eg/misc ; fi
|
|
@c file eg/data/BBS-list
|
|
aardvark 555-5553 1200/300 B
|
|
alpo-net 555-3412 2400/1200/300 A
|
|
barfly 555-7685 1200/300 A
|
|
bites 555-1675 2400/1200/300 A
|
|
camelot 555-0542 300 C
|
|
core 555-2912 1200/300 C
|
|
fooey 555-1234 2400/1200/300 B
|
|
foot 555-6699 1200/300 B
|
|
macfoo 555-6480 1200/300 A
|
|
sdace 555-3430 2400/1200/300 A
|
|
sabafoo 555-2127 1200/300 C
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{inventory-shipped} file
|
|
The @value{DF} @file{inventory-shipped} represents
|
|
information about shipments during the year.
|
|
Each record contains the month, the number
|
|
of green crates shipped, the number of red boxes shipped, the number of
|
|
orange bags shipped, and the number of blue packages shipped,
|
|
respectively. There are 16 entries, covering the 12 months of last year
|
|
and the first four months of the current year.
|
|
|
|
@example
|
|
@c file eg/data/inventory-shipped
|
|
Jan 13 25 15 115
|
|
Feb 15 32 24 226
|
|
Mar 15 24 34 228
|
|
Apr 31 52 63 420
|
|
May 16 34 29 208
|
|
Jun 31 42 75 492
|
|
Jul 24 34 67 436
|
|
Aug 15 34 47 316
|
|
Sep 13 55 37 277
|
|
Oct 29 54 68 525
|
|
Nov 20 87 82 577
|
|
Dec 17 35 61 401
|
|
|
|
Jan 21 36 64 620
|
|
Feb 26 58 80 652
|
|
Mar 24 75 70 495
|
|
Apr 21 70 74 514
|
|
@c endfile
|
|
@end example
|
|
|
|
@ifinfo
|
|
If you are reading this in GNU Emacs using Info, you can copy the regions
|
|
of text showing these sample files into your own test files. This way you
|
|
can try out the examples shown in the remainder of this document. You do
|
|
this by using the command @kbd{M-x write-region} to copy text from the Info
|
|
file into a file for use with @command{awk}
|
|
(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual},
|
|
for more information). Using this information, create your own
|
|
@file{BBS-list} and @file{inventory-shipped} files and practice what you
|
|
learn in this @value{DOCUMENT}.
|
|
|
|
@cindex Texinfo
|
|
If you are using the stand-alone version of Info,
|
|
see @ref{Extract Program},
|
|
for an @command{awk} program that extracts these @value{DF}s from
|
|
@file{gawk.texi}, the Texinfo source file for this Info file.
|
|
@end ifinfo
|
|
|
|
@node Very Simple
|
|
@section Some Simple Examples
|
|
|
|
The following command runs a simple @command{awk} program that searches the
|
|
input file @file{BBS-list} for the character string @samp{foo} (a
|
|
grouping of characters is usually called a @dfn{string};
|
|
the term @dfn{string} is based on similar usage in English, such
|
|
as ``a string of pearls,'' or ``a string of cars in a train''):
|
|
|
|
@example
|
|
awk '/foo/ @{ print $0 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
When lines containing @samp{foo} are found, they are printed because
|
|
@w{@samp{print $0}} means print the current line. (Just @samp{print} by
|
|
itself means the same thing, so we could have written that
|
|
instead.)
|
|
|
|
You will notice that slashes (@samp{/}) surround the string @samp{foo}
|
|
in the @command{awk} program. The slashes indicate that @samp{foo}
|
|
is the pattern to search for. This type of pattern is called a
|
|
@dfn{regular expression}, which is covered in more detail later
|
|
(@pxref{Regexp}).
|
|
The pattern is allowed to match parts of words.
|
|
There are
|
|
single quotes around the @command{awk} program so that the shell won't
|
|
interpret any of it as special shell characters.
|
|
|
|
Here is what this program prints:
|
|
|
|
@example
|
|
$ awk '/foo/ @{ print $0 @}' BBS-list
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@end example
|
|
|
|
@cindex actions, default
|
|
@cindex patterns, default
|
|
In an @command{awk} rule, either the pattern or the action can be omitted,
|
|
but not both. If the pattern is omitted, then the action is performed
|
|
for @emph{every} input line. If the action is omitted, the default
|
|
action is to print all lines that match the pattern.
|
|
|
|
@cindex actions, empty
|
|
Thus, we could leave out the action (the @code{print} statement and the curly
|
|
braces) in the previous example and the result would be the same: all
|
|
lines matching the pattern @samp{foo} are printed. By comparison,
|
|
omitting the @code{print} statement but retaining the curly braces makes an
|
|
empty action that does nothing (i.e., no lines are printed).
|
|
|
|
@cindex @command{awk} programs, one-line examples
|
|
Many practical @command{awk} programs are just a line or two. Following is a
|
|
collection of useful, short programs to get you started. Some of these
|
|
programs contain constructs that haven't been covered yet. (The description
|
|
of the program will give you a good idea of what is going on, but please
|
|
read the rest of the @value{DOCUMENT} to become an @command{awk} expert!)
|
|
Most of the examples use a @value{DF} named @file{data}. This is just a
|
|
placeholder; if you use these programs yourself, substitute
|
|
your own @value{FN}s for @file{data}.
|
|
For future reference, note that there is often more than
|
|
one way to do things in @command{awk}. At some point, you may want
|
|
to look back at these examples and see if
|
|
you can come up with different ways to do the same things shown here:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Print the length of the longest input line:
|
|
|
|
@example
|
|
awk '@{ if (length($0) > max) max = length($0) @}
|
|
END @{ print max @}' data
|
|
@end example
|
|
|
|
@item
|
|
Print every line that is longer than 80 characters:
|
|
|
|
@example
|
|
awk 'length($0) > 80' data
|
|
@end example
|
|
|
|
The sole rule has a relational expression as its pattern and it has no
|
|
action---so the default action, printing the record, is used.
|
|
|
|
@cindex @command{expand} utility
|
|
@item
|
|
Print the length of the longest line in @file{data}:
|
|
|
|
@example
|
|
expand data | awk '@{ if (x < length()) x = length() @}
|
|
END @{ print "maximum line length is " x @}'
|
|
@end example
|
|
|
|
The input is processed by the @command{expand} utility to change tabs
|
|
into spaces, so the widths compared are actually the right-margin columns.
|
|
|
|
@item
|
|
Print every line that has at least one field:
|
|
|
|
@example
|
|
awk 'NF > 0' data
|
|
@end example
|
|
|
|
This is an easy way to delete blank lines from a file (or rather, to
|
|
create a new file similar to the old file but from which the blank lines
|
|
have been removed).
|
|
|
|
@item
|
|
Print seven random numbers from 0 to 100, inclusive:
|
|
|
|
@example
|
|
awk 'BEGIN @{ for (i = 1; i <= 7; i++)
|
|
print int(101 * rand()) @}'
|
|
@end example
|
|
|
|
@item
|
|
Print the total number of bytes used by @var{files}:
|
|
|
|
@example
|
|
ls -l @var{files} | awk '@{ x += $5 @}
|
|
END @{ print "total bytes: " x @}'
|
|
@end example
|
|
|
|
@item
|
|
Print the total number of kilobytes used by @var{files}:
|
|
|
|
@c Don't use \ continuation, not discussed yet
|
|
@example
|
|
ls -l @var{files} | awk '@{ x += $5 @}
|
|
END @{ print "total K-bytes: " (x + 1023)/1024 @}'
|
|
@end example
|
|
|
|
@item
|
|
Print a sorted list of the login names of all users:
|
|
|
|
@example
|
|
awk -F: '@{ print $1 @}' /etc/passwd | sort
|
|
@end example
|
|
|
|
@item
|
|
Count the lines in a file:
|
|
|
|
@example
|
|
awk 'END @{ print NR @}' data
|
|
@end example
|
|
|
|
@item
|
|
Print the even-numbered lines in the @value{DF}:
|
|
|
|
@example
|
|
awk 'NR % 2 == 0' data
|
|
@end example
|
|
|
|
If you use the expression @samp{NR % 2 == 1} instead,
|
|
the program would print the odd-numbered lines.
|
|
@end itemize
|
|
|
|
@node Two Rules
|
|
@section An Example with Two Rules
|
|
@cindex @command{awk} programs
|
|
|
|
The @command{awk} utility reads the input files one line at a
|
|
time. For each line, @command{awk} tries the patterns of each of the rules.
|
|
If several patterns match, then several actions are run in the order in
|
|
which they appear in the @command{awk} program. If no patterns match, then
|
|
no actions are run.
|
|
|
|
After processing all the rules that match the line (and perhaps there are none),
|
|
@command{awk} reads the next line. (However,
|
|
@pxref{Next Statement},
|
|
and also @pxref{Nextfile Statement}).
|
|
This continues until the program reaches the end of the file.
|
|
For example, the following @command{awk} program contains two rules:
|
|
|
|
@example
|
|
/12/ @{ print $0 @}
|
|
/21/ @{ print $0 @}
|
|
@end example
|
|
|
|
@noindent
|
|
The first rule has the string @samp{12} as the
|
|
pattern and @samp{print $0} as the action. The second rule has the
|
|
string @samp{21} as the pattern and also has @samp{print $0} as the
|
|
action. Each rule's action is enclosed in its own pair of braces.
|
|
|
|
This program prints every line that contains the string
|
|
@samp{12} @emph{or} the string @samp{21}. If a line contains both
|
|
strings, it is printed twice, once by each rule.
|
|
|
|
This is what happens if we run this program on our two sample @value{DF}s,
|
|
@file{BBS-list} and @file{inventory-shipped}:
|
|
|
|
@example
|
|
$ awk '/12/ @{ print $0 @}
|
|
> /21/ @{ print $0 @}' BBS-list inventory-shipped
|
|
@print{} aardvark 555-5553 1200/300 B
|
|
@print{} alpo-net 555-3412 2400/1200/300 A
|
|
@print{} barfly 555-7685 1200/300 A
|
|
@print{} bites 555-1675 2400/1200/300 A
|
|
@print{} core 555-2912 1200/300 C
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sdace 555-3430 2400/1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@print{} Jan 21 36 64 620
|
|
@print{} Apr 21 70 74 514
|
|
@end example
|
|
|
|
@noindent
|
|
Note how the line beginning with @samp{sabafoo}
|
|
in @file{BBS-list} was printed twice, once for each rule.
|
|
|
|
@node More Complex
|
|
@section A More Complex Example
|
|
|
|
Now that we've mastered some simple tasks, let's look at
|
|
what typical @command{awk}
|
|
programs do. This example shows how @command{awk} can be used to
|
|
summarize, select, and rearrange the output of another utility. It uses
|
|
features that haven't been covered yet, so don't worry if you don't
|
|
understand all the details:
|
|
|
|
@example
|
|
ls -l | awk '$6 == "Nov" @{ sum += $5 @}
|
|
END @{ print sum @}'
|
|
@end example
|
|
|
|
@cindex @command{csh} utility, backslash continuation and
|
|
@cindex @command{ls} utility
|
|
@cindex backslash (@code{\}), continuing lines and, in @command{csh}
|
|
@cindex @code{\} (backslash), continuing lines and, in @command{csh}
|
|
This command prints the total number of bytes in all the files in the
|
|
current directory that were last modified in November (of any year).
|
|
@footnote{In the C shell (@command{csh}), you need to type
|
|
a semicolon and then a backslash at the end of the first line; see
|
|
@ref{Statements/Lines}, for an
|
|
explanation. In a POSIX-compliant shell, such as the Bourne
|
|
shell or @command{bash}, you can type the example as shown. If the command
|
|
@samp{echo $path} produces an empty output line, you are most likely
|
|
using a POSIX-compliant shell. Otherwise, you are probably using the
|
|
C shell or a shell derived from it.}
|
|
The @w{@samp{ls -l}} part of this example is a system command that gives
|
|
you a listing of the files in a directory, including each file's size and the date
|
|
the file was last modified. Its output looks like this:
|
|
|
|
@example
|
|
-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile
|
|
-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 awk.h
|
|
-rw-r--r-- 1 arnold user 983 Apr 13 12:14 awk.tab.h
|
|
-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 awk.y
|
|
-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 awk1.c
|
|
-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 awk2.c
|
|
-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 awk3.c
|
|
-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 awk4.c
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex line continuations, with C shell
|
|
The first field contains read-write permissions, the second field contains
|
|
the number of links to the file, and the third field identifies the owner of
|
|
the file. The fourth field identifies the group of the file.
|
|
The fifth field contains the size of the file in bytes. The
|
|
sixth, seventh, and eighth fields contain the month, day, and time,
|
|
respectively, that the file was last modified. Finally, the ninth field
|
|
contains the name of the file.@footnote{On some
|
|
very old systems, you may need to use @samp{ls -lg} to get this output.}
|
|
|
|
@c @cindex automatic initialization
|
|
@cindex initialization, automatic
|
|
The @samp{$6 == "Nov"} in our @command{awk} program is an expression that
|
|
tests whether the sixth field of the output from @w{@samp{ls -l}}
|
|
matches the string @samp{Nov}. Each time a line has the string
|
|
@samp{Nov} for its sixth field, the action @samp{sum += $5} is
|
|
performed. This adds the fifth field (the file's size) to the variable
|
|
@code{sum}. As a result, when @command{awk} has finished reading all the
|
|
input lines, @code{sum} is the total of the sizes of the files whose
|
|
lines matched the pattern. (This works because @command{awk} variables
|
|
are automatically initialized to zero.)
|
|
|
|
After the last line of output from @command{ls} has been processed, the
|
|
@code{END} rule executes and prints the value of @code{sum}.
|
|
In this example, the value of @code{sum} is 80600.
|
|
|
|
These more advanced @command{awk} techniques are covered in later sections
|
|
(@pxref{Action Overview}). Before you can move on to more
|
|
advanced @command{awk} programming, you have to know how @command{awk} interprets
|
|
your input and displays your output. By manipulating fields and using
|
|
@code{print} statements, you can produce some very useful and
|
|
impressive-looking reports.
|
|
|
|
@node Statements/Lines
|
|
@section @command{awk} Statements Versus Lines
|
|
@cindex line breaks
|
|
@cindex newlines
|
|
|
|
Most often, each line in an @command{awk} program is a separate statement or
|
|
separate rule, like this:
|
|
|
|
@example
|
|
awk '/12/ @{ print $0 @}
|
|
/21/ @{ print $0 @}' BBS-list inventory-shipped
|
|
@end example
|
|
|
|
@cindex @command{gawk}, newlines in
|
|
However, @command{gawk} ignores newlines after any of the following
|
|
symbols and keywords:
|
|
|
|
@example
|
|
, @{ ? : || && do else
|
|
@end example
|
|
|
|
@noindent
|
|
A newline at any other point is considered the end of the
|
|
statement.@footnote{The @samp{?} and @samp{:} referred to here is the
|
|
three-operand conditional expression described in
|
|
@ref{Conditional Exp}.
|
|
Splitting lines after @samp{?} and @samp{:} is a minor @command{gawk}
|
|
extension; if @option{--posix} is specified
|
|
(@pxref{Options}), then this extension is disabled.}
|
|
|
|
@cindex @code{\} (backslash), continuing lines and
|
|
@cindex backslash (@code{\}), continuing lines and
|
|
If you would like to split a single statement into two lines at a point
|
|
where a newline would terminate it, you can @dfn{continue} it by ending the
|
|
first line with a backslash character (@samp{\}). The backslash must be
|
|
the final character on the line in order to be recognized as a continuation
|
|
character. A backslash is allowed anywhere in the statement, even
|
|
in the middle of a string or regular expression. For example:
|
|
|
|
@example
|
|
awk '/This regular expression is too long, so continue it\
|
|
on the next line/ @{ print $1 @}'
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex portability, backslash continuation and
|
|
We have generally not used backslash continuation in the sample programs
|
|
in this @value{DOCUMENT}. In @command{gawk}, there is no limit on the
|
|
length of a line, so backslash continuation is never strictly necessary;
|
|
it just makes programs more readable. For this same reason, as well as
|
|
for clarity, we have kept most statements short in the sample programs
|
|
presented throughout the @value{DOCUMENT}. Backslash continuation is
|
|
most useful when your @command{awk} program is in a separate source file
|
|
instead of entered from the command line. You should also note that
|
|
many @command{awk} implementations are more particular about where you
|
|
may use backslash continuation. For example, they may not allow you to
|
|
split a string constant using backslash continuation. Thus, for maximum
|
|
portability of your @command{awk} programs, it is best not to split your
|
|
lines in the middle of a regular expression or a string.
|
|
@c 10/2000: gawk, mawk, and current bell labs awk allow it,
|
|
@c solaris 2.7 nawk does not. Solaris /usr/xpg4/bin/awk does though! sigh.
|
|
|
|
@cindex @command{csh} utility
|
|
@cindex backslash (@code{\}), continuing lines and, in @command{csh}
|
|
@cindex @code{\} (backslash), continuing lines and, in @command{csh}
|
|
@strong{Caution:} @emph{Backslash continuation does not work as described
|
|
with the C shell.} It works for @command{awk} programs in files and
|
|
for one-shot programs, @emph{provided} you are using a POSIX-compliant
|
|
shell, such as the Unix Bourne shell or @command{bash}. But the C shell behaves
|
|
differently! There, you must use two backslashes in a row, followed by
|
|
a newline. Note also that when using the C shell, @emph{every} newline
|
|
in your awk program must be escaped with a backslash. To illustrate:
|
|
|
|
@example
|
|
% awk 'BEGIN @{ \
|
|
? print \\
|
|
? "hello, world" \
|
|
? @}'
|
|
@print{} hello, world
|
|
@end example
|
|
|
|
@noindent
|
|
Here, the @samp{%} and @samp{?} are the C shell's primary and secondary
|
|
prompts, analogous to the standard shell's @samp{$} and @samp{>}.
|
|
|
|
Compare the previous example to how it is done with a POSIX-compliant shell:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{
|
|
> print \
|
|
> "hello, world"
|
|
> @}'
|
|
@print{} hello, world
|
|
@end example
|
|
|
|
@command{awk} is a line-oriented language. Each rule's action has to
|
|
begin on the same line as the pattern. To have the pattern and action
|
|
on separate lines, you @emph{must} use backslash continuation; there
|
|
is no other option.
|
|
|
|
@cindex backslash (@code{\}), continuing lines and, comments and
|
|
@cindex @code{\} (backslash), continuing lines and, comments and
|
|
@cindex commenting, backslash continuation and
|
|
Another thing to keep in mind is that backslash continuation and
|
|
comments do not mix. As soon as @command{awk} sees the @samp{#} that
|
|
starts a comment, it ignores @emph{everything} on the rest of the
|
|
line. For example:
|
|
|
|
@example
|
|
$ gawk 'BEGIN @{ print "dont panic" # a friendly \
|
|
> BEGIN rule
|
|
> @}'
|
|
@error{} gawk: cmd. line:2: BEGIN rule
|
|
@error{} gawk: cmd. line:2: ^ parse error
|
|
@end example
|
|
|
|
@noindent
|
|
In this case, it looks like the backslash would continue the comment onto the
|
|
next line. However, the backslash-newline combination is never even
|
|
noticed because it is ``hidden'' inside the comment. Thus, the
|
|
@code{BEGIN} is noted as a syntax error.
|
|
|
|
@cindex statements, multiple
|
|
@cindex @code{;} (semicolon)
|
|
@cindex semicolon (@code{;})
|
|
When @command{awk} statements within one rule are short, you might want to put
|
|
more than one of them on a line. This is accomplished by separating the statements
|
|
with a semicolon (@samp{;}).
|
|
This also applies to the rules themselves.
|
|
Thus, the program shown at the start of this @value{SECTION}
|
|
could also be written this way:
|
|
|
|
@example
|
|
/12/ @{ print $0 @} ; /21/ @{ print $0 @}
|
|
@end example
|
|
|
|
@noindent
|
|
@strong{Note:} The requirement that states that rules on the same line must be
|
|
separated with a semicolon was not in the original @command{awk}
|
|
language; it was added for consistency with the treatment of statements
|
|
within an action.
|
|
|
|
@node Other Features
|
|
@section Other Features of @command{awk}
|
|
|
|
@cindex variables
|
|
The @command{awk} language provides a number of predefined, or
|
|
@dfn{built-in}, variables that your programs can use to get information
|
|
from @command{awk}. There are other variables your program can set
|
|
as well to control how @command{awk} processes your data.
|
|
|
|
In addition, @command{awk} provides a number of built-in functions for doing
|
|
common computational and string-related operations.
|
|
@command{gawk} provides built-in functions for working with timestamps,
|
|
performing bit manipulation, and for runtime string translation.
|
|
|
|
As we develop our presentation of the @command{awk} language, we introduce
|
|
most of the variables and many of the functions. They are defined
|
|
systematically in @ref{Built-in Variables}, and
|
|
@ref{Built-in}.
|
|
|
|
@node When
|
|
@section When to Use @command{awk}
|
|
|
|
@cindex @command{awk}, uses for
|
|
Now that you've seen some of what @command{awk} can do,
|
|
you might wonder how @command{awk} could be useful for you. By using
|
|
utility programs, advanced patterns, field separators, arithmetic
|
|
statements, and other selection criteria, you can produce much more
|
|
complex output. The @command{awk} language is very useful for producing
|
|
reports from large amounts of raw data, such as summarizing information
|
|
from the output of other utility programs like @command{ls}.
|
|
(@xref{More Complex}.)
|
|
|
|
Programs written with @command{awk} are usually much smaller than they would
|
|
be in other languages. This makes @command{awk} programs easy to compose and
|
|
use. Often, @command{awk} programs can be quickly composed at your terminal,
|
|
used once, and thrown away. Because @command{awk} programs are interpreted, you
|
|
can avoid the (usually lengthy) compilation part of the typical
|
|
edit-compile-test-debug cycle of software development.
|
|
|
|
Complex programs have been written in @command{awk}, including a complete
|
|
retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for
|
|
more information), and a microcode assembler for a special-purpose Prolog
|
|
computer. However, @command{awk}'s capabilities are strained by tasks of
|
|
such complexity.
|
|
|
|
@cindex @command{awk} programs, complex
|
|
If you find yourself writing @command{awk} scripts of more than, say, a few
|
|
hundred lines, you might consider using a different programming
|
|
language. Emacs Lisp is a good choice if you need sophisticated string
|
|
or pattern matching capabilities. The shell is also good at string and
|
|
pattern matching; in addition, it allows powerful use of the system
|
|
utilities. More conventional languages, such as C, C++, and Java, offer
|
|
better facilities for system programming and for managing the complexity
|
|
of large programs. Programs in these languages may require more lines
|
|
of source code than the equivalent @command{awk} programs, but they are
|
|
easier to maintain and usually run more efficiently.
|
|
|
|
@node Regexp
|
|
@chapter Regular Expressions
|
|
@cindex regexp, See regular expressions
|
|
@c STARTOFRANGE regexp
|
|
@cindex regular expressions
|
|
|
|
A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a
|
|
set of strings.
|
|
Because regular expressions are such a fundamental part of @command{awk}
|
|
programming, their format and use deserve a separate @value{CHAPTER}.
|
|
|
|
@cindex forward slash (@code{/})
|
|
@cindex @code{/} (forward slash)
|
|
A regular expression enclosed in slashes (@samp{/})
|
|
is an @command{awk} pattern that matches every input record whose text
|
|
belongs to that set.
|
|
The simplest regular expression is a sequence of letters, numbers, or
|
|
both. Such a regexp matches any string that contains that sequence.
|
|
Thus, the regexp @samp{foo} matches any string containing @samp{foo}.
|
|
Therefore, the pattern @code{/foo/} matches any input record containing
|
|
the three characters @samp{foo} @emph{anywhere} in the record. Other
|
|
kinds of regexps let you specify more complicated classes of strings.
|
|
|
|
@ifnotinfo
|
|
Initially, the examples in this @value{CHAPTER} are simple.
|
|
As we explain more about how
|
|
regular expressions work, we will present more complicated instances.
|
|
@end ifnotinfo
|
|
|
|
@menu
|
|
* Regexp Usage:: How to Use Regular Expressions.
|
|
* Escape Sequences:: How to write nonprinting characters.
|
|
* Regexp Operators:: Regular Expression Operators.
|
|
* Character Lists:: What can go between @samp{[...]}.
|
|
* GNU Regexp Operators:: Operators specific to GNU software.
|
|
* Case-sensitivity:: How to do case-insensitive matching.
|
|
* Leftmost Longest:: How much text matches.
|
|
* Computed Regexps:: Using Dynamic Regexps.
|
|
* Locales:: How the locale affects things.
|
|
@end menu
|
|
|
|
@node Regexp Usage
|
|
@section How to Use Regular Expressions
|
|
|
|
@cindex regular expressions, as patterns
|
|
A regular expression can be used as a pattern by enclosing it in
|
|
slashes. Then the regular expression is tested against the
|
|
entire text of each record. (Normally, it only needs
|
|
to match some part of the text in order to succeed.) For example, the
|
|
following prints the second field of each record that contains the string
|
|
@samp{foo} anywhere in it:
|
|
|
|
@example
|
|
$ awk '/foo/ @{ print $2 @}' BBS-list
|
|
@print{} 555-1234
|
|
@print{} 555-6699
|
|
@print{} 555-6480
|
|
@print{} 555-2127
|
|
@end example
|
|
|
|
@cindex regular expressions, operators
|
|
@cindex operators, string-matching
|
|
@c @cindex operators, @code{~}
|
|
@cindex string-matching operators
|
|
@code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
@c @cindex operators, @code{!~}
|
|
@cindex @code{if} statement
|
|
@cindex @code{while} statement
|
|
@cindex @code{do}-@code{while} statement
|
|
@c @cindex statements, @code{if}
|
|
@c @cindex statements, @code{while}
|
|
@c @cindex statements, @code{do}
|
|
Regular expressions can also be used in matching expressions. These
|
|
expressions allow you to specify the string to match against; it need
|
|
not be the entire current input record. The two operators @samp{~}
|
|
and @samp{!~} perform regular expression comparisons. Expressions
|
|
using these operators can be used as patterns, or in @code{if},
|
|
@code{while}, @code{for}, and @code{do} statements.
|
|
(@xref{Statements}.)
|
|
For example:
|
|
|
|
@example
|
|
@var{exp} ~ /@var{regexp}/
|
|
@end example
|
|
|
|
@noindent
|
|
is true if the expression @var{exp} (taken as a string)
|
|
matches @var{regexp}. The following example matches, or selects,
|
|
all input records with the uppercase letter @samp{J} somewhere in the
|
|
first field:
|
|
|
|
@example
|
|
$ awk '$1 ~ /J/' inventory-shipped
|
|
@print{} Jan 13 25 15 115
|
|
@print{} Jun 31 42 75 492
|
|
@print{} Jul 24 34 67 436
|
|
@print{} Jan 21 36 64 620
|
|
@end example
|
|
|
|
So does this:
|
|
|
|
@example
|
|
awk '@{ if ($1 ~ /J/) print @}' inventory-shipped
|
|
@end example
|
|
|
|
This next example is true if the expression @var{exp}
|
|
(taken as a character string)
|
|
does @emph{not} match @var{regexp}:
|
|
|
|
@example
|
|
@var{exp} !~ /@var{regexp}/
|
|
@end example
|
|
|
|
The following example matches,
|
|
or selects, all input records whose first field @emph{does not} contain
|
|
the uppercase letter @samp{J}:
|
|
|
|
@example
|
|
$ awk '$1 !~ /J/' inventory-shipped
|
|
@print{} Feb 15 32 24 226
|
|
@print{} Mar 15 24 34 228
|
|
@print{} Apr 31 52 63 420
|
|
@print{} May 16 34 29 208
|
|
@dots{}
|
|
@end example
|
|
|
|
@cindex regexp constants
|
|
@cindex regular expressions, constants, See regexp constants
|
|
When a regexp is enclosed in slashes, such as @code{/foo/}, we call it
|
|
a @dfn{regexp constant}, much like @code{5.27} is a numeric constant and
|
|
@code{"foo"} is a string constant.
|
|
|
|
@node Escape Sequences
|
|
@section Escape Sequences
|
|
|
|
@cindex escape sequences
|
|
@cindex backslash (@code{\}), in escape sequences
|
|
@cindex @code{\} (backslash), in escape sequences
|
|
Some characters cannot be included literally in string constants
|
|
(@code{"foo"}) or regexp constants (@code{/foo/}).
|
|
Instead, they should be represented with @dfn{escape sequences},
|
|
which are character sequences beginning with a backslash (@samp{\}).
|
|
One use of an escape sequence is to include a double-quote character in
|
|
a string constant. Because a plain double quote ends the string, you
|
|
must use @samp{\"} to represent an actual double-quote character as a
|
|
part of the string. For example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}'
|
|
@print{} He said "hi!" to her.
|
|
@end example
|
|
|
|
The backslash character itself is another character that cannot be
|
|
included normally; you must write @samp{\\} to put one backslash in the
|
|
string or regexp. Thus, the string whose contents are the two characters
|
|
@samp{"} and @samp{\} must be written @code{"\"\\"}.
|
|
|
|
Backslash also represents unprintable characters
|
|
such as TAB or newline. While there is nothing to stop you from entering most
|
|
unprintable characters directly in a string constant or regexp constant,
|
|
they may look ugly.
|
|
|
|
The following table lists
|
|
all the escape sequences used in @command{awk} and
|
|
what they represent. Unless noted otherwise, all these escape
|
|
sequences apply to both string constants and regexp constants:
|
|
|
|
@table @code
|
|
@item \\
|
|
A literal backslash, @samp{\}.
|
|
|
|
@c @cindex @command{awk} language, V.4 version
|
|
@cindex @code{\} (backslash), @code{\a} escape sequence
|
|
@cindex backslash (@code{\}), @code{\a} escape sequence
|
|
@item \a
|
|
The ``alert'' character, @kbd{@value{CTL}-g}, ASCII code 7 (BEL).
|
|
(This usually makes some sort of audible noise.)
|
|
|
|
@cindex @code{\} (backslash), @code{\b} escape sequence
|
|
@cindex backslash (@code{\}), @code{\b} escape sequence
|
|
@item \b
|
|
Backspace, @kbd{@value{CTL}-h}, ASCII code 8 (BS).
|
|
|
|
@cindex @code{\} (backslash), @code{\f} escape sequence
|
|
@cindex backslash (@code{\}), @code{\f} escape sequence
|
|
@item \f
|
|
Formfeed, @kbd{@value{CTL}-l}, ASCII code 12 (FF).
|
|
|
|
@cindex @code{\} (backslash), @code{\n} escape sequence
|
|
@cindex backslash (@code{\}), @code{\n} escape sequence
|
|
@item \n
|
|
Newline, @kbd{@value{CTL}-j}, ASCII code 10 (LF).
|
|
|
|
@cindex @code{\} (backslash), @code{\r} escape sequence
|
|
@cindex backslash (@code{\}), @code{\r} escape sequence
|
|
@item \r
|
|
Carriage return, @kbd{@value{CTL}-m}, ASCII code 13 (CR).
|
|
|
|
@cindex @code{\} (backslash), @code{\t} escape sequence
|
|
@cindex backslash (@code{\}), @code{\t} escape sequence
|
|
@item \t
|
|
Horizontal TAB, @kbd{@value{CTL}-i}, ASCII code 9 (HT).
|
|
|
|
@c @cindex @command{awk} language, V.4 version
|
|
@cindex @code{\} (backslash), @code{\v} escape sequence
|
|
@cindex backslash (@code{\}), @code{\v} escape sequence
|
|
@item \v
|
|
Vertical tab, @kbd{@value{CTL}-k}, ASCII code 11 (VT).
|
|
|
|
@cindex @code{\} (backslash), @code{\}@var{nnn} escape sequence
|
|
@cindex backslash (@code{\}), @code{\}@var{nnn} escape sequence
|
|
@item \@var{nnn}
|
|
The octal value @var{nnn}, where @var{nnn} stands for 1 to 3 digits
|
|
between @samp{0} and @samp{7}. For example, the code for the ASCII ESC
|
|
(escape) character is @samp{\033}.
|
|
|
|
@c @cindex @command{awk} language, V.4 version
|
|
@c @cindex @command{awk} language, POSIX version
|
|
@cindex @code{\} (backslash), @code{\x} escape sequence
|
|
@cindex backslash (@code{\}), @code{\x} escape sequence
|
|
@item \x@var{hh}@dots{}
|
|
The hexadecimal value @var{hh}, where @var{hh} stands for a sequence
|
|
of hexadecimal digits (@samp{0}--@samp{9}, and either @samp{A}--@samp{F}
|
|
or @samp{a}--@samp{f}). Like the same construct
|
|
in ISO C, the escape sequence continues until the first nonhexadecimal
|
|
digit is seen. However, using more than two hexadecimal digits produces
|
|
undefined results. (The @samp{\x} escape sequence is not allowed in
|
|
POSIX @command{awk}.)
|
|
|
|
@cindex @code{\} (backslash), @code{\/} escape sequence
|
|
@cindex backslash (@code{\}), @code{\/} escape sequence
|
|
@item \/
|
|
A literal slash (necessary for regexp constants only).
|
|
This expression is used when you want to write a regexp
|
|
constant that contains a slash. Because the regexp is delimited by
|
|
slashes, you need to escape the slash that is part of the pattern,
|
|
in order to tell @command{awk} to keep processing the rest of the regexp.
|
|
|
|
@cindex @code{\} (backslash), @code{\"} escape sequence
|
|
@cindex backslash (@code{\}), @code{\"} escape sequence
|
|
@item \"
|
|
A literal double quote (necessary for string constants only).
|
|
This expression is used when you want to write a string
|
|
constant that contains a double quote. Because the string is delimited by
|
|
double quotes, you need to escape the quote that is part of the string,
|
|
in order to tell @command{awk} to keep processing the rest of the string.
|
|
@end table
|
|
|
|
In @command{gawk}, a number of additional two-character sequences that begin
|
|
with a backslash have special meaning in regexps.
|
|
@xref{GNU Regexp Operators}.
|
|
|
|
In a regexp, a backslash before any character that is not in the previous list
|
|
and not listed in
|
|
@ref{GNU Regexp Operators},
|
|
means that the next character should be taken literally, even if it would
|
|
normally be a regexp operator. For example, @code{/a\+b/} matches the three
|
|
characters @samp{a+b}.
|
|
|
|
@cindex backslash (@code{\}), in escape sequences
|
|
@cindex @code{\} (backslash), in escape sequences
|
|
@cindex portability
|
|
For complete portability, do not use a backslash before any character not
|
|
shown in the previous list.
|
|
|
|
To summarize:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The escape sequences in the table above are always processed first,
|
|
for both string constants and regexp constants. This happens very early,
|
|
as soon as @command{awk} reads your program.
|
|
|
|
@item
|
|
@command{gawk} processes both regexp constants and dynamic regexps
|
|
(@pxref{Computed Regexps}),
|
|
for the special operators listed in
|
|
@ref{GNU Regexp Operators}.
|
|
|
|
@item
|
|
A backslash before any other character means to treat that character
|
|
literally.
|
|
@end itemize
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Backslash Before Regular Characters
|
|
@cindex portability, backslash in escape sequences
|
|
@cindex POSIX @command{awk}, backslashes in string constants
|
|
@cindex backslash (@code{\}), in escape sequences, POSIX and
|
|
@cindex @code{\} (backslash), in escape sequences, POSIX and
|
|
|
|
@cindex troubleshooting, backslash before nonspecial character
|
|
If you place a backslash in a string constant before something that is
|
|
not one of the characters previously listed, POSIX @command{awk} purposely
|
|
leaves what happens as undefined. There are two choices:
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
@table @asis
|
|
@item Strip the backslash out
|
|
This is what Unix @command{awk} and @command{gawk} both do.
|
|
For example, @code{"a\qc"} is the same as @code{"aqc"}.
|
|
(Because this is such an easy bug both to introduce and to miss,
|
|
@command{gawk} warns you about it.)
|
|
Consider @samp{FS = @w{"[ \t]+\|[ \t]+"}} to use vertical bars
|
|
surrounded by whitespace as the field separator. There should be
|
|
two backslashes in the string @samp{FS = @w{"[ \t]+\\|[ \t]+"}}.)
|
|
@c I did this! This is why I added the warning.
|
|
|
|
@cindex @command{gawk}, escape sequences
|
|
@cindex Unix @command{awk}, backslashes in escape sequences
|
|
@item Leave the backslash alone
|
|
Some other @command{awk} implementations do this.
|
|
In such implementations, typing @code{"a\qc"} is the same as typing
|
|
@code{"a\\qc"}.
|
|
@end table
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Escape Sequences for Metacharacters
|
|
@cindex metacharacters, escape sequences for
|
|
|
|
Suppose you use an octal or hexadecimal
|
|
escape to represent a regexp metacharacter.
|
|
(See @ref{Regexp Operators}.)
|
|
Does @command{awk} treat the character as a literal character or as a regexp
|
|
operator?
|
|
|
|
@cindex dark corner, escape sequences, for metacharacters
|
|
Historically, such characters were taken literally.
|
|
@value{DARKCORNER}
|
|
However, the POSIX standard indicates that they should be treated
|
|
as real metacharacters, which is what @command{gawk} does.
|
|
In compatibility mode (@pxref{Options}),
|
|
@command{gawk} treats the characters represented by octal and hexadecimal
|
|
escape sequences literally when used in regexp constants. Thus,
|
|
@code{/a\52b/} is equivalent to @code{/a\*b/}.
|
|
|
|
@node Regexp Operators
|
|
@section Regular Expression Operators
|
|
@c STARTOFRANGE regexpo
|
|
@cindex regular expressions, operators
|
|
|
|
You can combine regular expressions with special characters,
|
|
called @dfn{regular expression operators} or @dfn{metacharacters}, to
|
|
increase the power and versatility of regular expressions.
|
|
|
|
The escape sequences described
|
|
@ifnotinfo
|
|
earlier
|
|
@end ifnotinfo
|
|
in @ref{Escape Sequences},
|
|
are valid inside a regexp. They are introduced by a @samp{\} and
|
|
are recognized and converted into corresponding real characters as
|
|
the very first step in processing regexps.
|
|
|
|
Here is a list of metacharacters. All characters that are not escape
|
|
sequences and that are not listed in the table stand for themselves:
|
|
|
|
@table @code
|
|
@cindex backslash (@code{\})
|
|
@cindex @code{\} (backslash)
|
|
@item \
|
|
This is used to suppress the special meaning of a character when
|
|
matching. For example, @samp{\$}
|
|
matches the character @samp{$}.
|
|
|
|
@cindex regular expressions, anchors in
|
|
@cindex Texinfo, chapter beginnings in files
|
|
@cindex @code{^} (caret)
|
|
@cindex caret (@code{^})
|
|
@item ^
|
|
This matches the beginning of a string. For example, @samp{^@@chapter}
|
|
matches @samp{@@chapter} at the beginning of a string and can be used
|
|
to identify chapter beginnings in Texinfo source files.
|
|
The @samp{^} is known as an @dfn{anchor}, because it anchors the pattern to
|
|
match only at the beginning of the string.
|
|
|
|
It is important to realize that @samp{^} does not match the beginning of
|
|
a line embedded in a string.
|
|
The condition is not true in the following example:
|
|
|
|
@example
|
|
if ("line1\nLINE 2" ~ /^L/) @dots{}
|
|
@end example
|
|
|
|
@cindex @code{$} (dollar sign)
|
|
@cindex dollar sign (@code{$})
|
|
@item $
|
|
This is similar to @samp{^}, but it matches only at the end of a string.
|
|
For example, @samp{p$}
|
|
matches a record that ends with a @samp{p}. The @samp{$} is an anchor
|
|
and does not match the end of a line embedded in a string.
|
|
The condition in the following example is not true:
|
|
|
|
@example
|
|
if ("line1\nLINE 2" ~ /1$/) @dots{}
|
|
@end example
|
|
|
|
@cindex @code{.} (period)
|
|
@cindex period (@code{.})
|
|
@item .
|
|
This matches any single character,
|
|
@emph{including} the newline character. For example, @samp{.P}
|
|
matches any single character followed by a @samp{P} in a string. Using
|
|
concatenation, we can make a regular expression such as @samp{U.A}, which
|
|
matches any three-character sequence that begins with @samp{U} and ends
|
|
with @samp{A}.
|
|
|
|
@c comma before using does NOT do tertiary
|
|
@cindex POSIX @command{awk}, period (@code{.}), using
|
|
In strict POSIX mode (@pxref{Options}),
|
|
@samp{.} does not match the @sc{nul}
|
|
character, which is a character with all bits equal to zero.
|
|
Otherwise, @sc{nul} is just another character. Other versions of @command{awk}
|
|
may not be able to match the @sc{nul} character.
|
|
|
|
@cindex @code{[]} (square brackets)
|
|
@cindex square brackets (@code{[]})
|
|
@cindex character lists
|
|
@cindex character sets, See Also character lists
|
|
@cindex bracket expressions, See character lists
|
|
@item [@dots{}]
|
|
This is called a @dfn{character list}.@footnote{In other literature,
|
|
you may see a character list referred to as either a
|
|
@dfn{character set}, a @dfn{character class}, or a @dfn{bracket expression}.}
|
|
It matches any @emph{one} of the characters that are enclosed in
|
|
the square brackets. For example, @samp{[MVX]} matches any one of
|
|
the characters @samp{M}, @samp{V}, or @samp{X} in a string. A full
|
|
discussion of what can be inside the square brackets of a character list
|
|
is given in
|
|
@ref{Character Lists}.
|
|
|
|
@cindex character lists, complemented
|
|
@item [^ @dots{}]
|
|
This is a @dfn{complemented character list}. The first character after
|
|
the @samp{[} @emph{must} be a @samp{^}. It matches any characters
|
|
@emph{except} those in the square brackets. For example, @samp{[^awk]}
|
|
matches any character that is not an @samp{a}, @samp{w},
|
|
or @samp{k}.
|
|
|
|
@cindex @code{|} (vertical bar)
|
|
@cindex vertical bar (@code{|})
|
|
@item |
|
|
This is the @dfn{alternation operator} and it is used to specify
|
|
alternatives.
|
|
The @samp{|} has the lowest precedence of all the regular
|
|
expression operators.
|
|
For example, @samp{^P|[[:digit:]]}
|
|
matches any string that matches either @samp{^P} or @samp{[[:digit:]]}. This
|
|
means it matches any string that starts with @samp{P} or contains a digit.
|
|
|
|
The alternation applies to the largest possible regexps on either side.
|
|
|
|
@cindex @code{()} (parentheses)
|
|
@cindex parentheses @code{()}
|
|
@item (@dots{})
|
|
Parentheses are used for grouping in regular expressions, as in
|
|
arithmetic. They can be used to concatenate regular expressions
|
|
containing the alternation operator, @samp{|}. For example,
|
|
@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and
|
|
@samp{@@samp@{bar@}}.
|
|
(These are Texinfo formatting control sequences. The @samp{+} is
|
|
explained further on in this list.)
|
|
|
|
@cindex @code{*} (asterisk), @code{*} operator, as regexp operator
|
|
@cindex asterisk (@code{*}), @code{*} operator, as regexp operator
|
|
@item *
|
|
This symbol means that the preceding regular expression should be
|
|
repeated as many times as necessary to find a match. For example, @samp{ph*}
|
|
applies the @samp{*} symbol to the preceding @samp{h} and looks for matches
|
|
of one @samp{p} followed by any number of @samp{h}s. This also matches
|
|
just @samp{p} if no @samp{h}s are present.
|
|
|
|
The @samp{*} repeats the @emph{smallest} possible preceding expression.
|
|
(Use parentheses if you want to repeat a larger expression.) It finds
|
|
as many repetitions as possible. For example,
|
|
@samp{awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample}
|
|
prints every record in @file{sample} containing a string of the form
|
|
@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on.
|
|
Notice the escaping of the parentheses by preceding them
|
|
with backslashes.
|
|
|
|
@cindex @code{+} (plus sign)
|
|
@cindex plus sign (@code{+})
|
|
@item +
|
|
This symbol is similar to @samp{*}, except that the preceding expression must be
|
|
matched at least once. This means that @samp{wh+y}
|
|
would match @samp{why} and @samp{whhy}, but not @samp{wy}, whereas
|
|
@samp{wh*y} would match all three of these strings.
|
|
The following is a simpler
|
|
way of writing the last @samp{*} example:
|
|
|
|
@example
|
|
awk '/\(c[ad]+r x\)/ @{ print @}' sample
|
|
@end example
|
|
|
|
@cindex @code{?} (question mark)
|
|
@cindex question mark (@code{?})
|
|
@item ?
|
|
This symbol is similar to @samp{*}, except that the preceding expression can be
|
|
matched either once or not at all. For example, @samp{fe?d}
|
|
matches @samp{fed} and @samp{fd}, but nothing else.
|
|
|
|
@cindex interval expressions
|
|
@item @{@var{n}@}
|
|
@itemx @{@var{n},@}
|
|
@itemx @{@var{n},@var{m}@}
|
|
One or two numbers inside braces denote an @dfn{interval expression}.
|
|
If there is one number in the braces, the preceding regexp is repeated
|
|
@var{n} times.
|
|
If there are two numbers separated by a comma, the preceding regexp is
|
|
repeated @var{n} to @var{m} times.
|
|
If there is one number followed by a comma, then the preceding regexp
|
|
is repeated at least @var{n} times:
|
|
|
|
@table @code
|
|
@item wh@{3@}y
|
|
Matches @samp{whhhy}, but not @samp{why} or @samp{whhhhy}.
|
|
|
|
@item wh@{3,5@}y
|
|
Matches @samp{whhhy}, @samp{whhhhy}, or @samp{whhhhhy}, only.
|
|
|
|
@item wh@{2,@}y
|
|
Matches @samp{whhy} or @samp{whhhy}, and so on.
|
|
@end table
|
|
|
|
@cindex POSIX @command{awk}, interval expressions in
|
|
Interval expressions were not traditionally available in @command{awk}.
|
|
They were added as part of the POSIX standard to make @command{awk}
|
|
and @command{egrep} consistent with each other.
|
|
|
|
@cindex @command{gawk}, interval expressions and
|
|
However, because old programs may use @samp{@{} and @samp{@}} in regexp
|
|
constants, by default @command{gawk} does @emph{not} match interval expressions
|
|
in regexps. If either @option{--posix} or @option{--re-interval} are specified
|
|
(@pxref{Options}), then interval expressions
|
|
are allowed in regexps.
|
|
|
|
For new programs that use @samp{@{} and @samp{@}} in regexp constants,
|
|
it is good practice to always escape them with a backslash. Then the
|
|
regexp constants are valid and work the way you want them to, using
|
|
any version of @command{awk}.@footnote{Use two backslashes if you're
|
|
using a string constant with a regexp operator or function.}
|
|
@end table
|
|
|
|
@cindex precedence, regexp operators
|
|
@cindex regular expressions, operators, precedence of
|
|
In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators,
|
|
as well as the braces @samp{@{} and @samp{@}},
|
|
have
|
|
the highest precedence, followed by concatenation, and finally by @samp{|}.
|
|
As in arithmetic, parentheses can change how operators are grouped.
|
|
|
|
@cindex POSIX @command{awk}, regular expressions and
|
|
@cindex @command{gawk}, regular expressions, precedence
|
|
In POSIX @command{awk} and @command{gawk}, the @samp{*}, @samp{+}, and @samp{?} operators
|
|
stand for themselves when there is nothing in the regexp that precedes them.
|
|
For example, @samp{/+/} matches a literal plus sign. However, many other versions of
|
|
@command{awk} treat such a usage as a syntax error.
|
|
|
|
If @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
POSIX character classes and interval expressions are not available in
|
|
regular expressions.
|
|
@c ENDOFRANGE regexpo
|
|
|
|
@node Character Lists
|
|
@section Using Character Lists
|
|
@c STARTOFRANGE charlist
|
|
@cindex character lists
|
|
@cindex character lists, range expressions
|
|
@cindex range expressions
|
|
|
|
Within a character list, a @dfn{range expression} consists of two
|
|
characters separated by a hyphen. It matches any single character that
|
|
sorts between the two characters, using the locale's
|
|
collating sequence and character set. For example, in the default C
|
|
locale, @samp{[a-dx-z]} is equivalent to @samp{[abcdxyz]}. Many locales
|
|
sort characters in dictionary order, and in these locales,
|
|
@samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; instead it
|
|
might be equivalent to @samp{[aBbCcDdxXyYz]}, for example. To obtain
|
|
the traditional interpretation of bracket expressions, you can use the C
|
|
locale by setting the @env{LC_ALL} environment variable to the value
|
|
@samp{C}.
|
|
|
|
@cindex @code{\} (backslash), in character lists
|
|
@cindex backslash (@code{\}), in character lists
|
|
@cindex @code{^} (caret), in character lists
|
|
@cindex caret (@code{^}), in character lists
|
|
@cindex @code{-} (hyphen), in character lists
|
|
@cindex hyphen (@code{-}), in character lists
|
|
To include one of the characters @samp{\}, @samp{]}, @samp{-}, or @samp{^} in a
|
|
character list, put a @samp{\} in front of it. For example:
|
|
|
|
@example
|
|
[d\]]
|
|
@end example
|
|
|
|
@noindent
|
|
matches either @samp{d} or @samp{]}.
|
|
|
|
@cindex POSIX @command{awk}, character lists and
|
|
@cindex Extended Regular Expressions (EREs)
|
|
@cindex EREs (Extended Regular Expressions)
|
|
@cindex @command{egrep} utility
|
|
This treatment of @samp{\} in character lists
|
|
is compatible with other @command{awk}
|
|
implementations and is also mandated by POSIX.
|
|
The regular expressions in @command{awk} are a superset
|
|
of the POSIX specification for Extended Regular Expressions (EREs).
|
|
POSIX EREs are based on the regular expressions accepted by the
|
|
traditional @command{egrep} utility.
|
|
|
|
@cindex character lists, character classes
|
|
@cindex POSIX @command{awk}, character lists and, character classes
|
|
@dfn{Character classes} are a new feature introduced in the POSIX standard.
|
|
A character class is a special notation for describing
|
|
lists of characters that have a specific attribute, but the
|
|
actual characters can vary from country to country and/or
|
|
from character set to character set. For example, the notion of what
|
|
is an alphabetic character differs between the United States and France.
|
|
|
|
A character class is only valid in a regexp @emph{inside} the
|
|
brackets of a character list. Character classes consist of @samp{[:},
|
|
a keyword denoting the class, and @samp{:]}. Here are the character
|
|
classes defined by the POSIX standard.
|
|
|
|
@c the regular table is commented out while trying out the multitable.
|
|
@c leave it here in case we need to go back, but make sure the text
|
|
@c still corresponds!
|
|
|
|
@ignore
|
|
@table @code
|
|
@item [:alnum:]
|
|
Alphanumeric characters.
|
|
|
|
@item [:alpha:]
|
|
Alphabetic characters.
|
|
|
|
@item [:blank:]
|
|
Space and TAB characters.
|
|
|
|
@item [:cntrl:]
|
|
Control characters.
|
|
|
|
@item [:digit:]
|
|
Numeric characters.
|
|
|
|
@item [:graph:]
|
|
Characters that are printable and visible.
|
|
(A space is printable but not visible, whereas an @samp{a} is both.)
|
|
|
|
@item [:lower:]
|
|
Lowercase alphabetic characters.
|
|
|
|
@item [:print:]
|
|
Printable characters (characters that are not control characters).
|
|
|
|
@item [:punct:]
|
|
Punctuation characters (characters that are not letters, digits,
|
|
control characters, or space characters).
|
|
|
|
@item [:space:]
|
|
Space characters (such as space, TAB, and formfeed, to name a few).
|
|
|
|
@item [:upper:]
|
|
Uppercase alphabetic characters.
|
|
|
|
@item [:xdigit:]
|
|
Characters that are hexadecimal digits.
|
|
@end table
|
|
@end ignore
|
|
|
|
@multitable {@code{[:xdigit:]}} {Characters that are both printable and visible. (A space is}
|
|
@item @code{[:alnum:]} @tab Alphanumeric characters.
|
|
@item @code{[:alpha:]} @tab Alphabetic characters.
|
|
@item @code{[:blank:]} @tab Space and TAB characters.
|
|
@item @code{[:cntrl:]} @tab Control characters.
|
|
@item @code{[:digit:]} @tab Numeric characters.
|
|
@item @code{[:graph:]} @tab Characters that are both printable and visible.
|
|
(A space is printable but not visible, whereas an @samp{a} is both.)
|
|
@item @code{[:lower:]} @tab Lowercase alphabetic characters.
|
|
@item @code{[:print:]} @tab Printable characters (characters that are not control characters).
|
|
@item @code{[:punct:]} @tab Punctuation characters (characters that are not letters, digits,
|
|
control characters, or space characters).
|
|
@item @code{[:space:]} @tab Space characters (such as space, TAB, and formfeed, to name a few).
|
|
@item @code{[:upper:]} @tab Uppercase alphabetic characters.
|
|
@item @code{[:xdigit:]} @tab Characters that are hexadecimal digits.
|
|
@end multitable
|
|
|
|
For example, before the POSIX standard, you had to write @code{/[A-Za-z0-9]/}
|
|
to match alphanumeric characters. If your
|
|
character set had other alphabetic characters in it, this would not
|
|
match them, and if your character set collated differently from
|
|
ASCII, this might not even match the ASCII alphanumeric characters.
|
|
With the POSIX character classes, you can write
|
|
@code{/[[:alnum:]]/} to match the alphabetic
|
|
and numeric characters in your character set.
|
|
|
|
@cindex character lists, collating elements
|
|
@cindex character lists, non-ASCII
|
|
@cindex collating elements
|
|
Two additional special sequences can appear in character lists.
|
|
These apply to non-ASCII character sets, which can have single symbols
|
|
(called @dfn{collating elements}) that are represented with more than one
|
|
character. They can also have several characters that are equivalent for
|
|
@dfn{collating}, or sorting, purposes. (For example, in French, a plain ``e''
|
|
and a grave-accented ``@`e'' are equivalent.)
|
|
These sequences are:
|
|
|
|
@table @asis
|
|
@cindex character lists, collating symbols
|
|
@cindex collating symbols
|
|
@item Collating symbols
|
|
Multicharacter collating elements enclosed between
|
|
@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element,
|
|
then @code{[[.ch.]]} is a regexp that matches this collating element, whereas
|
|
@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}.
|
|
|
|
@cindex character lists, equivalence classes
|
|
@item Equivalence classes
|
|
Locale-specific names for a list of
|
|
characters that are equal. The name is enclosed between
|
|
@samp{[=} and @samp{=]}.
|
|
For example, the name @samp{e} might be used to represent all of
|
|
``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e=]]} is a regexp
|
|
that matches any of @samp{e}, @samp{@'e}, or @samp{@`e}.
|
|
@end table
|
|
|
|
These features are very valuable in non-English-speaking locales.
|
|
|
|
@cindex internationalization, localization, character classes
|
|
@cindex @command{gawk}, character classes and
|
|
@cindex POSIX @command{awk}, character lists and, character classes
|
|
@strong{Caution:} The library functions that @command{gawk} uses for regular
|
|
expression matching currently recognize only POSIX character classes;
|
|
they do not recognize collating symbols or equivalence classes.
|
|
@c maybe one day ...
|
|
@c ENDOFRANGE charlist
|
|
|
|
@node GNU Regexp Operators
|
|
@section @command{gawk}-Specific Regexp Operators
|
|
|
|
@c This section adapted (long ago) from the regex-0.12 manual
|
|
|
|
@c STARTOFRANGE regexpg
|
|
@cindex regular expressions, operators, @command{gawk}
|
|
@c STARTOFRANGE gregexp
|
|
@cindex @command{gawk}, regular expressions, operators
|
|
@cindex operators, GNU-specific
|
|
@cindex regular expressions, operators, for words
|
|
@cindex word, regexp definition of
|
|
GNU software that deals with regular expressions provides a number of
|
|
additional regexp operators. These operators are described in this
|
|
@value{SECTION} and are specific to @command{gawk};
|
|
they are not available in other @command{awk} implementations.
|
|
Most of the additional operators deal with word matching.
|
|
For our purposes, a @dfn{word} is a sequence of one or more letters, digits,
|
|
or underscores (@samp{_}):
|
|
|
|
@table @code
|
|
@c @cindex operators, @code{\w} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\w} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\w} operator (@command{gawk})
|
|
@item \w
|
|
Matches any word-constituent character---that is, it matches any
|
|
letter, digit, or underscore. Think of it as shorthand for
|
|
@w{@code{[[:alnum:]_]}}.
|
|
|
|
@c @cindex operators, @code{\W} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\W} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\W} operator (@command{gawk})
|
|
@item \W
|
|
Matches any character that is not word-constituent.
|
|
Think of it as shorthand for
|
|
@w{@code{[^[:alnum:]_]}}.
|
|
|
|
@c @cindex operators, @code{\<} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\<} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\<} operator (@command{gawk})
|
|
@item \<
|
|
Matches the empty string at the beginning of a word.
|
|
For example, @code{/\<away/} matches @samp{away} but not
|
|
@samp{stowaway}.
|
|
|
|
@c @cindex operators, @code{\>} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\>} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\>} operator (@command{gawk})
|
|
@item \>
|
|
Matches the empty string at the end of a word.
|
|
For example, @code{/stow\>/} matches @samp{stow} but not @samp{stowaway}.
|
|
|
|
@c @cindex operators, @code{\y} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\y} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\y} operator (@command{gawk})
|
|
@c comma before using does NOT do secondary
|
|
@cindex word boundaries, matching
|
|
@item \y
|
|
Matches the empty string at either the beginning or the
|
|
end of a word (i.e., the word boundar@strong{y}). For example, @samp{\yballs?\y}
|
|
matches either @samp{ball} or @samp{balls}, as a separate word.
|
|
|
|
@c @cindex operators, @code{\B} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\B} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\B} operator (@command{gawk})
|
|
@item \B
|
|
Matches the empty string that occurs between two
|
|
word-constituent characters. For example,
|
|
@code{/\Brat\B/} matches @samp{crate} but it does not match @samp{dirty rat}.
|
|
@samp{\B} is essentially the opposite of @samp{\y}.
|
|
@end table
|
|
|
|
@cindex buffers, operators for
|
|
@cindex regular expressions, operators, for buffers
|
|
@cindex operators, string-matching, for buffers
|
|
There are two other operators that work on buffers. In Emacs, a
|
|
@dfn{buffer} is, naturally, an Emacs buffer. For other programs,
|
|
@command{gawk}'s regexp library routines consider the entire
|
|
string to match as the buffer.
|
|
The operators are:
|
|
|
|
@table @code
|
|
@item \`
|
|
@c @cindex operators, @code{\`} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\`} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\`} operator (@command{gawk})
|
|
Matches the empty string at the
|
|
beginning of a buffer (string).
|
|
|
|
@c @cindex operators, @code{\'} (@command{gawk})
|
|
@cindex backslash (@code{\}), @code{\'} operator (@command{gawk})
|
|
@cindex @code{\} (backslash), @code{\'} operator (@command{gawk})
|
|
@item \'
|
|
Matches the empty string at the
|
|
end of a buffer (string).
|
|
@end table
|
|
|
|
@cindex @code{^} (caret)
|
|
@cindex caret (@code{^})
|
|
@cindex @code{?} (question mark)
|
|
@cindex question mark (@code{?})
|
|
Because @samp{^} and @samp{$} always work in terms of the beginning
|
|
and end of strings, these operators don't add any new capabilities
|
|
for @command{awk}. They are provided for compatibility with other
|
|
GNU software.
|
|
|
|
@cindex @command{gawk}, word-boundary operator
|
|
@cindex word-boundary operator (@command{gawk})
|
|
@cindex operators, word-boundary (@command{gawk})
|
|
In other GNU software, the word-boundary operator is @samp{\b}. However,
|
|
that conflicts with the @command{awk} language's definition of @samp{\b}
|
|
as backspace, so @command{gawk} uses a different letter.
|
|
An alternative method would have been to require two backslashes in the
|
|
GNU operators, but this was deemed too confusing. The current
|
|
method of using @samp{\y} for the GNU @samp{\b} appears to be the
|
|
lesser of two evils.
|
|
|
|
@c NOTE!!! Keep this in sync with the same table in the summary appendix!
|
|
@c
|
|
@c Should really do this with file inclusion.
|
|
@cindex regular expressions, @command{gawk}, command-line options
|
|
@cindex @command{gawk}, command-line options
|
|
The various command-line options
|
|
(@pxref{Options})
|
|
control how @command{gawk} interprets characters in regexps:
|
|
|
|
@table @asis
|
|
@item No options
|
|
In the default case, @command{gawk} provides all the facilities of
|
|
POSIX regexps and the
|
|
@ifnotinfo
|
|
previously described
|
|
GNU regexp operators.
|
|
@end ifnotinfo
|
|
@ifnottex
|
|
GNU regexp operators described
|
|
in @ref{Regexp Operators}.
|
|
@end ifnottex
|
|
However, interval expressions are not supported.
|
|
|
|
@item @code{--posix}
|
|
Only POSIX regexps are supported; the GNU operators are not special
|
|
(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions
|
|
are allowed.
|
|
|
|
@item @code{--traditional}
|
|
Traditional Unix @command{awk} regexps are matched. The GNU operators
|
|
are not special, interval expressions are not available, nor
|
|
are the POSIX character classes (@code{[[:alnum:]]}, etc.).
|
|
Characters described by octal and hexadecimal escape sequences are
|
|
treated literally, even if they represent regexp metacharacters.
|
|
|
|
@item @code{--re-interval}
|
|
Allow interval expressions in regexps, even if @option{--traditional}
|
|
has been provided. (@option{--posix} automatically enables
|
|
interval expressions, so @option{--re-interval} is redundant
|
|
when @option{--posix} is is used.)
|
|
@end table
|
|
@c ENDOFRANGE gregexp
|
|
@c ENDOFRANGE regexpg
|
|
|
|
@node Case-sensitivity
|
|
@section Case Sensitivity in Matching
|
|
|
|
@c STARTOFRANGE regexpcs
|
|
@cindex regular expressions, case sensitivity
|
|
@c STARTOFRANGE csregexp
|
|
@cindex case sensitivity, regexps and
|
|
Case is normally significant in regular expressions, both when matching
|
|
ordinary characters (i.e., not metacharacters) and inside character
|
|
sets. Thus, a @samp{w} in a regular expression matches only a lowercase
|
|
@samp{w} and not an uppercase @samp{W}.
|
|
|
|
The simplest way to do a case-independent match is to use a character
|
|
list---for example, @samp{[Ww]}. However, this can be cumbersome if
|
|
you need to use it often, and it can make the regular expressions harder
|
|
to read. There are two alternatives that you might prefer.
|
|
|
|
One way to perform a case-insensitive match at a particular point in the
|
|
program is to convert the data to a single case, using the
|
|
@code{tolower} or @code{toupper} built-in string functions (which we
|
|
haven't discussed yet;
|
|
@pxref{String Functions}).
|
|
For example:
|
|
|
|
@example
|
|
tolower($1) ~ /foo/ @{ @dots{} @}
|
|
@end example
|
|
|
|
@noindent
|
|
converts the first field to lowercase before matching against it.
|
|
This works in any POSIX-compliant @command{awk}.
|
|
|
|
@cindex @command{gawk}, regular expressions, case sensitivity
|
|
@cindex case sensitivity, @command{gawk}
|
|
@cindex differences in @command{awk} and @command{gawk}, regular expressions
|
|
@cindex @code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
@cindex @code{IGNORECASE} variable
|
|
@c @cindex variables, @code{IGNORECASE}
|
|
Another method, specific to @command{gawk}, is to set the variable
|
|
@code{IGNORECASE} to a nonzero value (@pxref{Built-in Variables}).
|
|
When @code{IGNORECASE} is not zero, @emph{all} regexp and string
|
|
operations ignore case. Changing the value of
|
|
@code{IGNORECASE} dynamically controls the case-sensitivity of the
|
|
program as it runs. Case is significant by default because
|
|
@code{IGNORECASE} (like most variables) is initialized to zero:
|
|
|
|
@example
|
|
x = "aB"
|
|
if (x ~ /ab/) @dots{} # this test will fail
|
|
|
|
IGNORECASE = 1
|
|
if (x ~ /ab/) @dots{} # now it will succeed
|
|
@end example
|
|
|
|
In general, you cannot use @code{IGNORECASE} to make certain rules
|
|
case-insensitive and other rules case-sensitive, because there is no
|
|
straightforward way
|
|
to set @code{IGNORECASE} just for the pattern of
|
|
a particular rule.@footnote{Experienced C and C++ programmers will note
|
|
that it is possible, using something like
|
|
@samp{IGNORECASE = 1 && /foObAr/ @{ @dots{} @}}
|
|
and
|
|
@samp{IGNORECASE = 0 || /foobar/ @{ @dots{} @}}.
|
|
However, this is somewhat obscure and we don't recommend it.}
|
|
To do this, use either character lists or @code{tolower}. However, one
|
|
thing you can do with @code{IGNORECASE} only is dynamically turn
|
|
case-sensitivity on or off for all the rules at once.
|
|
|
|
@code{IGNORECASE} can be set on the command line or in a @code{BEGIN} rule
|
|
(@pxref{Other Arguments}; also
|
|
@pxref{Using BEGIN/END}).
|
|
Setting @code{IGNORECASE} from the command line is a way to make
|
|
a program case-insensitive without having to edit it.
|
|
|
|
Prior to @command{gawk} 3.0, the value of @code{IGNORECASE}
|
|
affected regexp operations only. It did not affect string comparison
|
|
with @samp{==}, @samp{!=}, and so on.
|
|
Beginning with @value{PVERSION} 3.0, both regexp and string comparison
|
|
operations are also affected by @code{IGNORECASE}.
|
|
|
|
@c @cindex ISO 8859-1
|
|
@c @cindex ISO Latin-1
|
|
Beginning with @command{gawk} 3.0,
|
|
the equivalences between upper-
|
|
and lowercase characters are based on the ISO-8859-1 (ISO Latin-1)
|
|
character set. This character set is a superset of the traditional 128
|
|
ASCII characters, which also provides a number of characters suitable
|
|
for use with European languages.
|
|
|
|
The value of @code{IGNORECASE} has no effect if @command{gawk} is in
|
|
compatibility mode (@pxref{Options}).
|
|
Case is always significant in compatibility mode.
|
|
@c ENDOFRANGE csregexp
|
|
@c ENDOFRANGE regexpcs
|
|
|
|
@node Leftmost Longest
|
|
@section How Much Text Matches?
|
|
|
|
@cindex regular expressions, leftmost longest match
|
|
@c @cindex matching, leftmost longest
|
|
Consider the following:
|
|
|
|
@example
|
|
echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
|
|
@end example
|
|
|
|
This example uses the @code{sub} function (which we haven't discussed yet;
|
|
@pxref{String Functions})
|
|
to make a change to the input record. Here, the regexp @code{/a+/}
|
|
indicates ``one or more @samp{a} characters,'' and the replacement
|
|
text is @samp{<A>}.
|
|
|
|
The input contains four @samp{a} characters.
|
|
@command{awk} (and POSIX) regular expressions always match
|
|
the leftmost, @emph{longest} sequence of input characters that can
|
|
match. Thus, all four @samp{a} characters are
|
|
replaced with @samp{<A>} in this example:
|
|
|
|
@example
|
|
$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
|
|
@print{} <A>bcd
|
|
@end example
|
|
|
|
For simple match/no-match tests, this is not so important. But when doing
|
|
text matching and substitutions with the @code{match}, @code{sub}, @code{gsub},
|
|
and @code{gensub} functions, it is very important.
|
|
@ifinfo
|
|
@xref{String Functions},
|
|
for more information on these functions.
|
|
@end ifinfo
|
|
Understanding this principle is also important for regexp-based record
|
|
and field splitting (@pxref{Records},
|
|
and also @pxref{Field Separators}).
|
|
|
|
@node Computed Regexps
|
|
@section Using Dynamic Regexps
|
|
|
|
@c STARTOFRANGE dregexp
|
|
@cindex regular expressions, computed
|
|
@c STARTOFRANGE regexpd
|
|
@cindex regular expressions, dynamic
|
|
@cindex @code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
@c @cindex operators, @code{~}
|
|
@c @cindex operators, @code{!~}
|
|
The righthand side of a @samp{~} or @samp{!~} operator need not be a
|
|
regexp constant (i.e., a string of characters between slashes). It may
|
|
be any expression. The expression is evaluated and converted to a string
|
|
if necessary; the contents of the string are used as the
|
|
regexp. A regexp that is computed in this way is called a @dfn{dynamic
|
|
regexp}:
|
|
|
|
@example
|
|
BEGIN @{ digits_regexp = "[[:digit:]]+" @}
|
|
$0 ~ digits_regexp @{ print @}
|
|
@end example
|
|
|
|
@noindent
|
|
This sets @code{digits_regexp} to a regexp that describes one or more digits,
|
|
and tests whether the input record matches this regexp.
|
|
|
|
@c @strong{Caution:}
|
|
When using the @samp{~} and @samp{!~}
|
|
@strong{Caution:} When using the @samp{~} and @samp{!~}
|
|
operators, there is a difference between a regexp constant
|
|
enclosed in slashes and a string constant enclosed in double quotes.
|
|
If you are going to use a string constant, you have to understand that
|
|
the string is, in essence, scanned @emph{twice}: the first time when
|
|
@command{awk} reads your program, and the second time when it goes to
|
|
match the string on the lefthand side of the operator with the pattern
|
|
on the right. This is true of any string-valued expression (such as
|
|
@code{digits_regexp}, shown previously), not just string constants.
|
|
|
|
@cindex regexp constants, slashes vs. quotes
|
|
@cindex @code{\} (backslash), regexp constants
|
|
@cindex backslash (@code{\}), regexp constants
|
|
@cindex @code{"} (double quote), regexp constants
|
|
@cindex double quote (@code{"}), regexp constants
|
|
What difference does it make if the string is
|
|
scanned twice? The answer has to do with escape sequences, and particularly
|
|
with backslashes. To get a backslash into a regular expression inside a
|
|
string, you have to type two backslashes.
|
|
|
|
For example, @code{/\*/} is a regexp constant for a literal @samp{*}.
|
|
Only one backslash is needed. To do the same thing with a string,
|
|
you have to type @code{"\\*"}. The first backslash escapes the
|
|
second one so that the string actually contains the
|
|
two characters @samp{\} and @samp{*}.
|
|
|
|
@cindex troubleshooting, regexp constants vs. string constants
|
|
@cindex regexp constants, vs. string constants
|
|
@cindex string constants, vs. regexp constants
|
|
Given that you can use both regexp and string constants to describe
|
|
regular expressions, which should you use? The answer is ``regexp
|
|
constants,'' for several reasons:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
String constants are more complicated to write and
|
|
more difficult to read. Using regexp constants makes your programs
|
|
less error-prone. Not understanding the difference between the two
|
|
kinds of constants is a common source of errors.
|
|
|
|
@item
|
|
It is more efficient to use regexp constants. @command{awk} can note
|
|
that you have supplied a regexp and store it internally in a form that
|
|
makes pattern matching more efficient. When using a string constant,
|
|
@command{awk} must first convert the string into this internal form and
|
|
then perform the pattern matching.
|
|
|
|
@item
|
|
Using regexp constants is better form; it shows clearly that you
|
|
intend a regexp match.
|
|
@end itemize
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Using @code{\n} in Character Lists of Dynamic Regexps
|
|
@cindex regular expressions, dynamic, with embedded newlines
|
|
@cindex newlines, in dynamic regexps
|
|
|
|
Some commercial versions of @command{awk} do not allow the newline
|
|
character to be used inside a character list for a dynamic regexp:
|
|
|
|
@example
|
|
$ awk '$0 ~ "[ \t\n]"'
|
|
@error{} awk: newline in character class [
|
|
@error{} ]...
|
|
@error{} source line number 1
|
|
@error{} context is
|
|
@error{} >>> <<<
|
|
@end example
|
|
|
|
@cindex newlines, in regexp constants
|
|
But a newline in a regexp constant works with no problem:
|
|
|
|
@example
|
|
$ awk '$0 ~ /[ \t\n]/'
|
|
here is a sample line
|
|
@print{} here is a sample line
|
|
@kbd{@value{CTL}-d}
|
|
@end example
|
|
|
|
@command{gawk} does not have this problem, and it isn't likely to
|
|
occur often in practice, but it's worth noting for future reference.
|
|
@c ENDOFRANGE dregexp
|
|
@c ENDOFRANGE regexpd
|
|
@c ENDOFRANGE regexp
|
|
|
|
@node Locales
|
|
@section Where You Are Makes A Difference
|
|
|
|
Modern systems support the notion of @dfn{locales}: a way to tell
|
|
the system about the local character set and language. The current
|
|
locale setting can affect the way regexp matching works, often
|
|
in surprising ways. In particular, many locales do case-insensitive
|
|
matching, even when you may have specified characters of only
|
|
one particular case.
|
|
|
|
The following example uses the @code{sub} function, which
|
|
does text replacement
|
|
(@pxref{String Functions}).
|
|
Here, the intent is to remove trailing uppercase characters:
|
|
|
|
@example
|
|
$ echo something1234abc | gawk '@{ sub("[A-Z]*$", ""); print @}'
|
|
@print{} something1234
|
|
@end example
|
|
|
|
@noindent
|
|
This output is unexpected, since the @samp{abc} at the end of @samp{something1234abc}
|
|
should not normally match @samp{[A-Z]*}. This result is due to the
|
|
locale setting (and thus you may not see it on your system).
|
|
There are two fixes. The first is to use the POSIX character
|
|
class @samp{[[:upper:]]}, instead of @samp{[A-Z]}.
|
|
The second is to change the locale setting in the environment,
|
|
before running @command{gawk},
|
|
by using the shell statements:
|
|
|
|
@example
|
|
LANG=C LC_ALL=C
|
|
export LANG LC_ALL
|
|
@end example
|
|
|
|
The setting @samp{C} forces @command{gawk} to behave in the traditional
|
|
Unix manner, where case distinctions do matter.
|
|
You may wish to put these statements into your shell startup file,
|
|
e.g., @file{$HOME/.profile}.
|
|
|
|
Similar considerations apply to other ranges. For example,
|
|
@samp{["-/]} is perfectly valid in ASCII, but is not valid in many
|
|
Unicode locales, such as @samp{en_US.UTF-8}. (In general, such
|
|
ranges should be avoided; either list the characters individually,
|
|
or use a POSIX character class such as @samp{[[:punct:]]}.)
|
|
|
|
For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant.
|
|
For other single byte record separators, using @samp{LC_ALL=C} will give you
|
|
much better performance when reading records. Otherwise, @command{gawk} has
|
|
to make several function calls, @emph{per input character} to find the record
|
|
terminator.
|
|
|
|
@node Reading Files
|
|
@chapter Reading Input Files
|
|
|
|
@c STARTOFRANGE infir
|
|
@cindex input files, reading
|
|
@cindex input files
|
|
@cindex @code{FILENAME} variable
|
|
In the typical @command{awk} program, all input is read either from the
|
|
standard input (by default, this is the keyboard, but often it is a pipe from another
|
|
command) or from files whose names you specify on the @command{awk}
|
|
command line. If you specify input files, @command{awk} reads them
|
|
in order, processing all the data from one before going on to the next.
|
|
The name of the current input file can be found in the built-in variable
|
|
@code{FILENAME}
|
|
(@pxref{Built-in Variables}).
|
|
|
|
@cindex records
|
|
@cindex fields
|
|
The input is read in units called @dfn{records}, and is processed by the
|
|
rules of your program one record at a time.
|
|
By default, each record is one line. Each
|
|
record is automatically split into chunks called @dfn{fields}.
|
|
This makes it more convenient for programs to work on the parts of a record.
|
|
|
|
@cindex @code{getline} command
|
|
On rare occasions, you may need to use the @code{getline} command.
|
|
The @code{getline} command is valuable, both because it
|
|
can do explicit input from any number of files, and because the files
|
|
used with it do not have to be named on the @command{awk} command line
|
|
(@pxref{Getline}).
|
|
|
|
@menu
|
|
* Records:: Controlling how data is split into records.
|
|
* Fields:: An introduction to fields.
|
|
* Nonconstant Fields:: Nonconstant Field Numbers.
|
|
* Changing Fields:: Changing the Contents of a Field.
|
|
* Field Separators:: The field separator and how to change it.
|
|
* Constant Size:: Reading constant width data.
|
|
* Multiple Line:: Reading multi-line records.
|
|
* Getline:: Reading files under explicit program control
|
|
using the @code{getline} function.
|
|
@end menu
|
|
|
|
@node Records
|
|
@section How Input Is Split into Records
|
|
|
|
@c STARTOFRANGE inspl
|
|
@cindex input, splitting into records
|
|
@c STARTOFRANGE recspl
|
|
@cindex records, splitting input into
|
|
@cindex @code{NR} variable
|
|
@cindex @code{FNR} variable
|
|
The @command{awk} utility divides the input for your @command{awk}
|
|
program into records and fields.
|
|
@command{awk} keeps track of the number of records that have
|
|
been read
|
|
so far
|
|
from the current input file. This value is stored in a
|
|
built-in variable called @code{FNR}. It is reset to zero when a new
|
|
file is started. Another built-in variable, @code{NR}, is the total
|
|
number of input records read so far from all @value{DF}s. It starts at zero,
|
|
but is never automatically reset to zero.
|
|
|
|
@cindex separators, for records
|
|
@cindex record separators
|
|
Records are separated by a character called the @dfn{record separator}.
|
|
By default, the record separator is the newline character.
|
|
This is why records are, by default, single lines.
|
|
A different character can be used for the record separator by
|
|
assigning the character to the built-in variable @code{RS}.
|
|
|
|
@cindex newlines, as record separators
|
|
@cindex @code{RS} variable
|
|
Like any other variable,
|
|
the value of @code{RS} can be changed in the @command{awk} program
|
|
with the assignment operator, @samp{=}
|
|
(@pxref{Assignment Ops}).
|
|
The new record-separator character should be enclosed in quotation marks,
|
|
which indicate a string constant. Often the right time to do this is
|
|
at the beginning of execution, before any input is processed,
|
|
so that the very first record is read with the proper separator.
|
|
To do this, use the special @code{BEGIN} pattern
|
|
(@pxref{BEGIN/END}).
|
|
For example:
|
|
|
|
@cindex @code{BEGIN} pattern
|
|
@example
|
|
awk 'BEGIN @{ RS = "/" @}
|
|
@{ print $0 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
changes the value of @code{RS} to @code{"/"}, before reading any input.
|
|
This is a string whose first character is a slash; as a result, records
|
|
are separated by slashes. Then the input file is read, and the second
|
|
rule in the @command{awk} program (the action with no pattern) prints each
|
|
record. Because each @code{print} statement adds a newline at the end of
|
|
its output, this @command{awk} program copies the input
|
|
with each slash changed to a newline. Here are the results of running
|
|
the program on @file{BBS-list}:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ RS = "/" @}
|
|
> @{ print $0 @}' BBS-list
|
|
@print{} aardvark 555-5553 1200
|
|
@print{} 300 B
|
|
@print{} alpo-net 555-3412 2400
|
|
@print{} 1200
|
|
@print{} 300 A
|
|
@print{} barfly 555-7685 1200
|
|
@print{} 300 A
|
|
@print{} bites 555-1675 2400
|
|
@print{} 1200
|
|
@print{} 300 A
|
|
@print{} camelot 555-0542 300 C
|
|
@print{} core 555-2912 1200
|
|
@print{} 300 C
|
|
@print{} fooey 555-1234 2400
|
|
@print{} 1200
|
|
@print{} 300 B
|
|
@print{} foot 555-6699 1200
|
|
@print{} 300 B
|
|
@print{} macfoo 555-6480 1200
|
|
@print{} 300 A
|
|
@print{} sdace 555-3430 2400
|
|
@print{} 1200
|
|
@print{} 300 A
|
|
@print{} sabafoo 555-2127 1200
|
|
@print{} 300 C
|
|
@print{}
|
|
@end example
|
|
|
|
@noindent
|
|
Note that the entry for the @samp{camelot} BBS is not split.
|
|
In the original @value{DF}
|
|
(@pxref{Sample Data Files}),
|
|
the line looks like this:
|
|
|
|
@example
|
|
camelot 555-0542 300 C
|
|
@end example
|
|
|
|
@noindent
|
|
It has one baud rate only, so there are no slashes in the record,
|
|
unlike the others which have two or more baud rates.
|
|
In fact, this record is treated as part of the record
|
|
for the @samp{core} BBS; the newline separating them in the output
|
|
is the original newline in the @value{DF}, not the one added by
|
|
@command{awk} when it printed the record!
|
|
|
|
@cindex record separators, changing
|
|
@cindex separators, for records
|
|
Another way to change the record separator is on the command line,
|
|
using the variable-assignment feature
|
|
(@pxref{Other Arguments}):
|
|
|
|
@example
|
|
awk '@{ print $0 @}' RS="/" BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
This sets @code{RS} to @samp{/} before processing @file{BBS-list}.
|
|
|
|
Using an unusual character such as @samp{/} for the record separator
|
|
produces correct behavior in the vast majority of cases. However,
|
|
the following (extreme) pipeline prints a surprising @samp{1}:
|
|
|
|
@example
|
|
$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}'
|
|
@print{} 1
|
|
@end example
|
|
|
|
There is one field, consisting of a newline. The value of the built-in
|
|
variable @code{NF} is the number of fields in the current record.
|
|
|
|
@cindex dark corner, input files
|
|
Reaching the end of an input file terminates the current input record,
|
|
even if the last character in the file is not the character in @code{RS}.
|
|
@value{DARKCORNER}
|
|
|
|
@cindex null strings
|
|
@cindex strings, empty, See null strings
|
|
The empty string @code{""} (a string without any characters)
|
|
has a special meaning
|
|
as the value of @code{RS}. It means that records are separated
|
|
by one or more blank lines and nothing else.
|
|
@xref{Multiple Line}, for more details.
|
|
|
|
If you change the value of @code{RS} in the middle of an @command{awk} run,
|
|
the new value is used to delimit subsequent records, but the record
|
|
currently being processed, as well as records already processed, are not
|
|
affected.
|
|
|
|
@cindex @code{RT} variable
|
|
@cindex records, terminating
|
|
@cindex terminating records
|
|
@cindex differences in @command{awk} and @command{gawk}, record separators
|
|
@cindex regular expressions, as record separators
|
|
@cindex record separators, regular expressions as
|
|
@cindex separators, for records, regular expressions as
|
|
After the end of the record has been determined, @command{gawk}
|
|
sets the variable @code{RT} to the text in the input that matched
|
|
@code{RS}.
|
|
When using @command{gawk},
|
|
the value of @code{RS} is not limited to a one-character
|
|
string. It can be any regular expression
|
|
(@pxref{Regexp}).
|
|
In general, each record
|
|
ends at the next string that matches the regular expression; the next
|
|
record starts at the end of the matching string. This general rule is
|
|
actually at work in the usual case, where @code{RS} contains just a
|
|
newline: a record ends at the beginning of the next matching string (the
|
|
next newline in the input), and the following record starts just after
|
|
the end of this string (at the first character of the following line).
|
|
The newline, because it matches @code{RS}, is not part of either record.
|
|
|
|
When @code{RS} is a single character, @code{RT}
|
|
contains the same single character. However, when @code{RS} is a
|
|
regular expression, @code{RT} contains
|
|
the actual input text that matched the regular expression.
|
|
|
|
The following example illustrates both of these features.
|
|
It sets @code{RS} equal to a regular expression that
|
|
matches either a newline or a series of one or more uppercase letters
|
|
with optional leading and/or trailing whitespace:
|
|
|
|
@example
|
|
$ echo record 1 AAAA record 2 BBBB record 3 |
|
|
> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}
|
|
> @{ print "Record =", $0, "and RT =", RT @}'
|
|
@print{} Record = record 1 and RT = AAAA
|
|
@print{} Record = record 2 and RT = BBBB
|
|
@print{} Record = record 3 and RT =
|
|
@print{}
|
|
@end example
|
|
|
|
@noindent
|
|
The final line of output has an extra blank line. This is because the
|
|
value of @code{RT} is a newline, and the @code{print} statement
|
|
supplies its own terminating newline.
|
|
@xref{Simple Sed}, for a more useful example
|
|
of @code{RS} as a regexp and @code{RT}.
|
|
|
|
If you set @code{RS} to a regular expression that allows optional
|
|
trailing text, such as @samp{RS = "abc(XYZ)?"} it is possible, due
|
|
to implementation constraints, that @command{gawk} may match the leading
|
|
part of the regular expression, but not the trailing part, particularly
|
|
if the input text that could match the trailing part is fairly long.
|
|
@command{gawk} attempts to avoid this problem, but currently, there's
|
|
no guarantee that this will never happen.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{RS}/@code{RT} variables
|
|
The use of @code{RS} as a regular expression and the @code{RT}
|
|
variable are @command{gawk} extensions; they are not available in
|
|
compatibility mode
|
|
(@pxref{Options}).
|
|
In compatibility mode, only the first character of the value of
|
|
@code{RS} is used to determine the end of the record.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: @code{RS = "\0"} Is Not Portable
|
|
|
|
@cindex advanced features, @value{DF}s as single record
|
|
@cindex portability, @value{DF}s as single record
|
|
There are times when you might want to treat an entire @value{DF} as a
|
|
single record. The only way to make this happen is to give @code{RS}
|
|
a value that you know doesn't occur in the input file. This is hard
|
|
to do in a general way, such that a program always works for arbitrary
|
|
input files.
|
|
@c can you say `understatement' boys and girls?
|
|
|
|
You might think that for text files, the @sc{nul} character, which
|
|
consists of a character with all bits equal to zero, is a good
|
|
value to use for @code{RS} in this case:
|
|
|
|
@example
|
|
BEGIN @{ RS = "\0" @} # whole file becomes one record?
|
|
@end example
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, strings, storing
|
|
@command{gawk} in fact accepts this, and uses the @sc{nul}
|
|
character for the record separator.
|
|
However, this usage is @emph{not} portable
|
|
to other @command{awk} implementations.
|
|
|
|
@cindex dark corner, strings, storing
|
|
All other @command{awk} implementations@footnote{At least that we know
|
|
about.} store strings internally as C-style strings. C strings use the
|
|
@sc{nul} character as the string terminator. In effect, this means that
|
|
@samp{RS = "\0"} is the same as @samp{RS = ""}.
|
|
@value{DARKCORNER}
|
|
|
|
@cindex records, treating files as
|
|
@cindex files, as single records
|
|
The best way to treat a whole file as a single record is to
|
|
simply read the file in, one record at a time, concatenating each
|
|
record onto the end of the previous ones.
|
|
@c ENDOFRANGE inspl
|
|
@c ENDOFRANGE recspl
|
|
|
|
@node Fields
|
|
@section Examining Fields
|
|
|
|
@cindex examining fields
|
|
@cindex fields
|
|
@cindex accessing fields
|
|
@c STARTOFRANGE fiex
|
|
@cindex fields, examining
|
|
@cindex POSIX @command{awk}, field separators and
|
|
@cindex field separators, POSIX and
|
|
@cindex separators, field, POSIX and
|
|
When @command{awk} reads an input record, the record is
|
|
automatically @dfn{parsed} or separated by the interpreter into chunks
|
|
called @dfn{fields}. By default, fields are separated by @dfn{whitespace},
|
|
like words in a line.
|
|
Whitespace in @command{awk} means any string of one or more spaces,
|
|
tabs, or newlines;@footnote{In POSIX @command{awk}, newlines are not
|
|
considered whitespace for separating fields.} other characters, such as
|
|
formfeed, vertical tab, etc.@: that are
|
|
considered whitespace by other languages, are @emph{not} considered
|
|
whitespace by @command{awk}.
|
|
|
|
The purpose of fields is to make it more convenient for you to refer to
|
|
these pieces of the record. You don't have to use them---you can
|
|
operate on the whole record if you want---but fields are what make
|
|
simple @command{awk} programs so powerful.
|
|
|
|
@cindex @code{$} field operator
|
|
@cindex field operator @code{$}
|
|
@cindex @code{$} (dollar sign), @code{$} field operator
|
|
@cindex dollar sign (@code{$}), @code{$} field operator
|
|
@c The comma here does NOT mark a secondary term:
|
|
@cindex field operators, dollar sign as
|
|
A dollar-sign (@samp{$}) is used
|
|
to refer to a field in an @command{awk} program,
|
|
followed by the number of the field you want. Thus, @code{$1}
|
|
refers to the first field, @code{$2} to the second, and so on.
|
|
(Unlike the Unix shells, the field numbers are not limited to single digits.
|
|
@code{$127} is the one hundred twenty-seventh field in the record.)
|
|
For example, suppose the following is a line of input:
|
|
|
|
@example
|
|
This seems like a pretty nice example.
|
|
@end example
|
|
|
|
@noindent
|
|
Here the first field, or @code{$1}, is @samp{This}, the second field, or
|
|
@code{$2}, is @samp{seems}, and so on. Note that the last field,
|
|
@code{$7}, is @samp{example.}. Because there is no space between the
|
|
@samp{e} and the @samp{.}, the period is considered part of the seventh
|
|
field.
|
|
|
|
@cindex @code{NF} variable
|
|
@cindex fields, number of
|
|
@code{NF} is a built-in variable whose value is the number of fields
|
|
in the current record. @command{awk} automatically updates the value
|
|
of @code{NF} each time it reads a record. No matter how many fields
|
|
there are, the last field in a record can be represented by @code{$NF}.
|
|
So, @code{$NF} is the same as @code{$7}, which is @samp{example.}.
|
|
If you try to reference a field beyond the last
|
|
one (such as @code{$8} when the record has only seven fields), you get
|
|
the empty string. (If used in a numeric operation, you get zero.)
|
|
|
|
The use of @code{$0}, which looks like a reference to the ``zero-th'' field, is
|
|
a special case: it represents the whole input record
|
|
when you are not interested in specific fields.
|
|
Here are some more examples:
|
|
|
|
@example
|
|
$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@end example
|
|
|
|
@noindent
|
|
This example prints each record in the file @file{BBS-list} whose first
|
|
field contains the string @samp{foo}. The operator @samp{~} is called a
|
|
@dfn{matching operator}
|
|
(@pxref{Regexp Usage});
|
|
it tests whether a string (here, the field @code{$1}) matches a given regular
|
|
expression.
|
|
|
|
By contrast, the following example
|
|
looks for @samp{foo} in @emph{the entire record} and prints the first
|
|
field and the last field for each matching input record:
|
|
|
|
@example
|
|
$ awk '/foo/ @{ print $1, $NF @}' BBS-list
|
|
@print{} fooey B
|
|
@print{} foot B
|
|
@print{} macfoo A
|
|
@print{} sabafoo C
|
|
@end example
|
|
@c ENDOFRANGE fiex
|
|
|
|
@node Nonconstant Fields
|
|
@section Nonconstant Field Numbers
|
|
@cindex fields, numbers
|
|
@cindex field numbers
|
|
|
|
The number of a field does not need to be a constant. Any expression in
|
|
the @command{awk} language can be used after a @samp{$} to refer to a
|
|
field. The value of the expression specifies the field number. If the
|
|
value is a string, rather than a number, it is converted to a number.
|
|
Consider this example:
|
|
|
|
@example
|
|
awk '@{ print $NR @}'
|
|
@end example
|
|
|
|
@noindent
|
|
Recall that @code{NR} is the number of records read so far: one in the
|
|
first record, two in the second, etc. So this example prints the first
|
|
field of the first record, the second field of the second record, and so
|
|
on. For the twentieth record, field number 20 is printed; most likely,
|
|
the record has fewer than 20 fields, so this prints a blank line.
|
|
Here is another example of using expressions as field numbers:
|
|
|
|
@example
|
|
awk '@{ print $(2*2) @}' BBS-list
|
|
@end example
|
|
|
|
@command{awk} evaluates the expression @samp{(2*2)} and uses
|
|
its value as the number of the field to print. The @samp{*} sign
|
|
represents multiplication, so the expression @samp{2*2} evaluates to four.
|
|
The parentheses are used so that the multiplication is done before the
|
|
@samp{$} operation; they are necessary whenever there is a binary
|
|
operator in the field-number expression. This example, then, prints the
|
|
hours of operation (the fourth field) for every line of the file
|
|
@file{BBS-list}. (All of the @command{awk} operators are listed, in
|
|
order of decreasing precedence, in
|
|
@ref{Precedence}.)
|
|
|
|
If the field number you compute is zero, you get the entire record.
|
|
Thus, @samp{$(2-2)} has the same value as @code{$0}. Negative field
|
|
numbers are not allowed; trying to reference one usually terminates
|
|
the program. (The POSIX standard does not define
|
|
what happens when you reference a negative field number. @command{gawk}
|
|
notices this and terminates your program. Other @command{awk}
|
|
implementations may behave differently.)
|
|
|
|
As mentioned in @ref{Fields},
|
|
@command{awk} stores the current record's number of fields in the built-in
|
|
variable @code{NF} (also @pxref{Built-in Variables}). The expression
|
|
@code{$NF} is not a special feature---it is the direct consequence of
|
|
evaluating @code{NF} and using its value as a field number.
|
|
|
|
@node Changing Fields
|
|
@section Changing the Contents of a Field
|
|
|
|
@c STARTOFRANGE ficon
|
|
@cindex fields, changing contents of
|
|
The contents of a field, as seen by @command{awk}, can be changed within an
|
|
@command{awk} program; this changes what @command{awk} perceives as the
|
|
current input record. (The actual input is untouched; @command{awk} @emph{never}
|
|
modifies the input file.)
|
|
Consider the following example and its output:
|
|
|
|
@example
|
|
$ awk '@{ nboxes = $3 ; $3 = $3 - 10
|
|
> print nboxes, $3 @}' inventory-shipped
|
|
@print{} 25 15
|
|
@print{} 32 22
|
|
@print{} 24 14
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
The program first saves the original value of field three in the variable
|
|
@code{nboxes}.
|
|
The @samp{-} sign represents subtraction, so this program reassigns
|
|
field three, @code{$3}, as the original value of field three minus ten:
|
|
@samp{$3 - 10}. (@xref{Arithmetic Ops}.)
|
|
Then it prints the original and new values for field three.
|
|
(Someone in the warehouse made a consistent mistake while inventorying
|
|
the red boxes.)
|
|
|
|
For this to work, the text in field @code{$3} must make sense
|
|
as a number; the string of characters must be converted to a number
|
|
for the computer to do arithmetic on it. The number resulting
|
|
from the subtraction is converted back to a string of characters that
|
|
then becomes field three.
|
|
@xref{Conversion}.
|
|
|
|
When the value of a field is changed (as perceived by @command{awk}), the
|
|
text of the input record is recalculated to contain the new field where
|
|
the old one was. In other words, @code{$0} changes to reflect the altered
|
|
field. Thus, this program
|
|
prints a copy of the input file, with 10 subtracted from the second
|
|
field of each line:
|
|
|
|
@example
|
|
$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped
|
|
@print{} Jan 3 25 15 115
|
|
@print{} Feb 5 32 24 226
|
|
@print{} Mar 5 24 34 228
|
|
@dots{}
|
|
@end example
|
|
|
|
It is also possible to also assign contents to fields that are out
|
|
of range. For example:
|
|
|
|
@example
|
|
$ awk '@{ $6 = ($5 + $4 + $3 + $2)
|
|
> print $6 @}' inventory-shipped
|
|
@print{} 168
|
|
@print{} 297
|
|
@print{} 301
|
|
@dots{}
|
|
@end example
|
|
|
|
@cindex adding, fields
|
|
@cindex fields, adding
|
|
@noindent
|
|
We've just created @code{$6}, whose value is the sum of fields
|
|
@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign
|
|
represents addition. For the file @file{inventory-shipped}, @code{$6}
|
|
represents the total number of parcels shipped for a particular month.
|
|
|
|
Creating a new field changes @command{awk}'s internal copy of the current
|
|
input record, which is the value of @code{$0}. Thus, if you do @samp{print $0}
|
|
after adding a field, the record printed includes the new field, with
|
|
the appropriate number of field separators between it and the previously
|
|
existing fields.
|
|
|
|
@cindex @code{OFS} variable
|
|
@cindex output field separator, See @code{OFS} variable
|
|
@cindex field separators, See Also @code{OFS}
|
|
This recomputation affects and is affected by
|
|
@code{NF} (the number of fields; @pxref{Fields}).
|
|
For example, the value of @code{NF} is set to the number of the highest
|
|
field you create.
|
|
The exact format of @code{$0} is also affected by a feature that has not been discussed yet:
|
|
the @dfn{output field separator}, @code{OFS},
|
|
used to separate the fields (@pxref{Output Separators}).
|
|
|
|
Note, however, that merely @emph{referencing} an out-of-range field
|
|
does @emph{not} change the value of either @code{$0} or @code{NF}.
|
|
Referencing an out-of-range field only produces an empty string. For
|
|
example:
|
|
|
|
@example
|
|
if ($(NF+1) != "")
|
|
print "can't happen"
|
|
else
|
|
print "everything is normal"
|
|
@end example
|
|
|
|
@noindent
|
|
should print @samp{everything is normal}, because @code{NF+1} is certain
|
|
to be out of range. (@xref{If Statement},
|
|
for more information about @command{awk}'s @code{if-else} statements.
|
|
@xref{Typing and Comparison},
|
|
for more information about the @samp{!=} operator.)
|
|
|
|
It is important to note that making an assignment to an existing field
|
|
changes the
|
|
value of @code{$0} but does not change the value of @code{NF},
|
|
even when you assign the empty string to a field. For example:
|
|
|
|
@example
|
|
$ echo a b c d | awk '@{ OFS = ":"; $2 = ""
|
|
> print $0; print NF @}'
|
|
@print{} a::c:d
|
|
@print{} 4
|
|
@end example
|
|
|
|
@noindent
|
|
The field is still there; it just has an empty value, denoted by
|
|
the two colons between @samp{a} and @samp{c}.
|
|
This example shows what happens if you create a new field:
|
|
|
|
@example
|
|
$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"
|
|
> print $0; print NF @}'
|
|
@print{} a::c:d::new
|
|
@print{} 6
|
|
@end example
|
|
|
|
@noindent
|
|
The intervening field, @code{$5}, is created with an empty value
|
|
(indicated by the second pair of adjacent colons),
|
|
and @code{NF} is updated with the value six.
|
|
|
|
@c FIXME: Verify that this is in POSIX
|
|
@cindex dark corner, @code{NF} variable, decrementing
|
|
@cindex @code{NF} variable, decrementing
|
|
Decrementing @code{NF} throws away the values of the fields
|
|
after the new value of @code{NF} and recomputes @code{$0}.
|
|
@value{DARKCORNER}
|
|
Here is an example:
|
|
|
|
@example
|
|
$ echo a b c d e f | awk '@{ print "NF =", NF;
|
|
> NF = 3; print $0 @}'
|
|
@print{} NF = 6
|
|
@print{} a b c
|
|
@end example
|
|
|
|
@c the comma before decrementing does NOT represent a tertiary entry
|
|
@cindex portability, @code{NF} variable, decrementing
|
|
@strong{Caution:} Some versions of @command{awk} don't
|
|
rebuild @code{$0} when @code{NF} is decremented. Caveat emptor.
|
|
|
|
Finally, there are times when it is convenient to force
|
|
@command{awk} to rebuild the entire record, using the current
|
|
value of the fields and @code{OFS}. To do this, use the
|
|
seemingly innocuous assignment:
|
|
|
|
@example
|
|
$1 = $1 # force record to be reconstituted
|
|
print $0 # or whatever else with $0
|
|
@end example
|
|
|
|
@noindent
|
|
This forces @command{awk} rebuild the record. It does help
|
|
to add a comment, as we've shown here.
|
|
|
|
There is a flip side to the relationship between @code{$0} and
|
|
the fields. Any assignment to @code{$0} causes the record to be
|
|
reparsed into fields using the @emph{current} value of @code{FS}.
|
|
This also applies to any built-in function that updates @code{$0},
|
|
such as @code{sub} and @code{gsub}
|
|
(@pxref{String Functions}).
|
|
@c ENDOFRANGE ficon
|
|
|
|
@node Field Separators
|
|
@section Specifying How Fields Are Separated
|
|
|
|
@menu
|
|
* Regexp Field Splitting:: Using regexps as the field separator.
|
|
* Single Character Fields:: Making each character a separate field.
|
|
* Command Line Field Separator:: Setting @code{FS} from the command-line.
|
|
* Field Splitting Summary:: Some final points and a summary table.
|
|
@end menu
|
|
|
|
@cindex @code{FS} variable
|
|
@cindex fields, separating
|
|
@c STARTOFRANGE fisepr
|
|
@cindex field separators
|
|
@c STARTOFRANGE fisepg
|
|
@cindex fields, separating
|
|
The @dfn{field separator}, which is either a single character or a regular
|
|
expression, controls the way @command{awk} splits an input record into fields.
|
|
@command{awk} scans the input record for character sequences that
|
|
match the separator; the fields themselves are the text between the matches.
|
|
|
|
In the examples that follow, we use the bullet symbol (@bullet{}) to
|
|
represent spaces in the output.
|
|
If the field separator is @samp{oo}, then the following line:
|
|
|
|
@example
|
|
moo goo gai pan
|
|
@end example
|
|
|
|
@noindent
|
|
is split into three fields: @samp{m}, @samp{@bullet{}g}, and
|
|
@samp{@bullet{}gai@bullet{}pan}.
|
|
Note the leading spaces in the values of the second and third fields.
|
|
|
|
@cindex troubleshooting, @command{awk} uses @code{FS} not @code{IFS}
|
|
The field separator is represented by the built-in variable @code{FS}.
|
|
Shell programmers take note: @command{awk} does @emph{not} use the
|
|
name @code{IFS} that is used by the POSIX-compliant shells (such as
|
|
the Unix Bourne shell, @command{sh}, or @command{bash}).
|
|
|
|
@cindex @code{FS} variable, changing value of
|
|
The value of @code{FS} can be changed in the @command{awk} program with the
|
|
assignment operator, @samp{=} (@pxref{Assignment Ops}).
|
|
Often the right time to do this is at the beginning of execution
|
|
before any input has been processed, so that the very first record
|
|
is read with the proper separator. To do this, use the special
|
|
@code{BEGIN} pattern
|
|
(@pxref{BEGIN/END}).
|
|
For example, here we set the value of @code{FS} to the string
|
|
@code{","}:
|
|
|
|
@example
|
|
awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
|
|
@end example
|
|
|
|
@cindex @code{BEGIN} pattern
|
|
@noindent
|
|
Given the input line:
|
|
|
|
@example
|
|
John Q. Smith, 29 Oak St., Walamazoo, MI 42139
|
|
@end example
|
|
|
|
@noindent
|
|
this @command{awk} program extracts and prints the string
|
|
@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
|
|
|
|
@cindex field separators, choice of
|
|
@cindex regular expressions as field separators
|
|
@cindex field separators, regular expressions as
|
|
Sometimes the input data contains separator characters that don't
|
|
separate fields the way you thought they would. For instance, the
|
|
person's name in the example we just used might have a title or
|
|
suffix attached, such as:
|
|
|
|
@example
|
|
John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
|
|
@end example
|
|
|
|
@noindent
|
|
The same program would extract @samp{@bullet{}LXIX}, instead of
|
|
@samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
|
|
If you were expecting the program to print the
|
|
address, you would be surprised. The moral is to choose your data layout and
|
|
separator characters carefully to prevent such problems.
|
|
(If the data is not in a form that is easy to process, perhaps you
|
|
can massage it first with a separate @command{awk} program.)
|
|
|
|
@cindex newlines, as field separators
|
|
@cindex whitespace, as field separators
|
|
Fields are normally separated by whitespace sequences
|
|
(spaces, tabs, and newlines), not by single spaces. Two spaces in a row do not
|
|
delimit an empty field. The default value of the field separator @code{FS}
|
|
is a string containing a single space, @w{@code{" "}}. If @command{awk}
|
|
interpreted this value in the usual way, each space character would separate
|
|
fields, so two spaces in a row would make an empty field between them.
|
|
The reason this does not happen is that a single space as the value of
|
|
@code{FS} is a special case---it is taken to specify the default manner
|
|
of delimiting fields.
|
|
|
|
If @code{FS} is any other single character, such as @code{","}, then
|
|
each occurrence of that character separates two fields. Two consecutive
|
|
occurrences delimit an empty field. If the character occurs at the
|
|
beginning or the end of the line, that too delimits an empty field. The
|
|
space character is the only single character that does not follow these
|
|
rules.
|
|
|
|
@node Regexp Field Splitting
|
|
@subsection Using Regular Expressions to Separate Fields
|
|
|
|
@c STARTOFRANGE regexpfs
|
|
@cindex regular expressions, as field separators
|
|
@c STARTOFRANGE fsregexp
|
|
@cindex field separators, regular expressions as
|
|
The previous @value{SUBSECTION}
|
|
discussed the use of single characters or simple strings as the
|
|
value of @code{FS}.
|
|
More generally, the value of @code{FS} may be a string containing any
|
|
regular expression. In this case, each match in the record for the regular
|
|
expression separates fields. For example, the assignment:
|
|
|
|
@example
|
|
FS = ", \t"
|
|
@end example
|
|
|
|
@noindent
|
|
makes every area of an input line that consists of a comma followed by a
|
|
space and a TAB into a field separator.
|
|
@ifinfo
|
|
(@samp{\t}
|
|
is an @dfn{escape sequence} that stands for a TAB;
|
|
@pxref{Escape Sequences},
|
|
for the complete list of similar escape sequences.)
|
|
@end ifinfo
|
|
|
|
For a less trivial example of a regular expression, try using
|
|
single spaces to separate fields the way single commas are used.
|
|
@code{FS} can be set to @w{@code{"[@ ]"}} (left bracket, space, right
|
|
bracket). This regular expression matches a single space and nothing else
|
|
(@pxref{Regexp}).
|
|
|
|
There is an important difference between the two cases of @samp{FS = @w{" "}}
|
|
(a single space) and @samp{FS = @w{"[ \t\n]+"}}
|
|
(a regular expression matching one or more spaces, tabs, or newlines).
|
|
For both values of @code{FS}, fields are separated by @dfn{runs}
|
|
(multiple adjacent occurrences) of spaces, tabs,
|
|
and/or newlines. However, when the value of @code{FS} is @w{@code{" "}},
|
|
@command{awk} first strips leading and trailing whitespace from
|
|
the record and then decides where the fields are.
|
|
For example, the following pipeline prints @samp{b}:
|
|
|
|
@example
|
|
$ echo ' a b c d ' | awk '@{ print $2 @}'
|
|
@print{} b
|
|
@end example
|
|
|
|
@noindent
|
|
However, this pipeline prints @samp{a} (note the extra spaces around
|
|
each letter):
|
|
|
|
@example
|
|
$ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t\n]+" @}
|
|
> @{ print $2 @}'
|
|
@print{} a
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex null strings
|
|
@cindex strings, null
|
|
@cindex empty strings, See null strings
|
|
In this case, the first field is @dfn{null} or empty.
|
|
|
|
The stripping of leading and trailing whitespace also comes into
|
|
play whenever @code{$0} is recomputed. For instance, study this pipeline:
|
|
|
|
@example
|
|
$ echo ' a b c d' | awk '@{ print; $2 = $2; print @}'
|
|
@print{} a b c d
|
|
@print{} a b c d
|
|
@end example
|
|
|
|
@noindent
|
|
The first @code{print} statement prints the record as it was read,
|
|
with leading whitespace intact. The assignment to @code{$2} rebuilds
|
|
@code{$0} by concatenating @code{$1} through @code{$NF} together,
|
|
separated by the value of @code{OFS}. Because the leading whitespace
|
|
was ignored when finding @code{$1}, it is not part of the new @code{$0}.
|
|
Finally, the last @code{print} statement prints the new @code{$0}.
|
|
@c ENDOFRANGE regexpfs
|
|
@c ENDOFRANGE fsregexp
|
|
|
|
@node Single Character Fields
|
|
@subsection Making Each Character a Separate Field
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, single-character fields
|
|
@cindex single-character fields
|
|
@cindex fields, single-character
|
|
There are times when you may want to examine each character
|
|
of a record separately. This can be done in @command{gawk} by
|
|
simply assigning the null string (@code{""}) to @code{FS}. In this case,
|
|
each individual character in the record becomes a separate field.
|
|
For example:
|
|
|
|
@example
|
|
$ echo a b | gawk 'BEGIN @{ FS = "" @}
|
|
> @{
|
|
> for (i = 1; i <= NF; i = i + 1)
|
|
> print "Field", i, "is", $i
|
|
> @}'
|
|
@print{} Field 1 is a
|
|
@print{} Field 2 is
|
|
@print{} Field 3 is b
|
|
@end example
|
|
|
|
@cindex dark corner, @code{FS} as null string
|
|
@cindex FS variable, as null string
|
|
Traditionally, the behavior of @code{FS} equal to @code{""} was not defined.
|
|
In this case, most versions of Unix @command{awk} simply treat the entire record
|
|
as only having one field.
|
|
@value{DARKCORNER}
|
|
In compatibility mode
|
|
(@pxref{Options}),
|
|
if @code{FS} is the null string, then @command{gawk} also
|
|
behaves this way.
|
|
|
|
@node Command Line Field Separator
|
|
@subsection Setting @code{FS} from the Command Line
|
|
@cindex @code{-F} option
|
|
@cindex options, command-line
|
|
@cindex command line, options
|
|
@cindex field separators, on command line
|
|
@c The comma before "setting" does NOT represent a tertiary
|
|
@cindex command line, @code{FS} on, setting
|
|
@cindex @code{FS} variable, setting from command line
|
|
|
|
@code{FS} can be set on the command line. Use the @option{-F} option to
|
|
do so. For example:
|
|
|
|
@example
|
|
awk -F, '@var{program}' @var{input-files}
|
|
@end example
|
|
|
|
@noindent
|
|
sets @code{FS} to the @samp{,} character. Notice that the option uses
|
|
an uppercase @samp{F} instead of a lowercase @samp{f}. The latter
|
|
option (@option{-f}) specifies a file
|
|
containing an @command{awk} program. Case is significant in command-line
|
|
options:
|
|
the @option{-F} and @option{-f} options have nothing to do with each other.
|
|
You can use both options at the same time to set the @code{FS} variable
|
|
@emph{and} get an @command{awk} program from a file.
|
|
|
|
The value used for the argument to @option{-F} is processed in exactly the
|
|
same way as assignments to the built-in variable @code{FS}.
|
|
Any special characters in the field separator must be escaped
|
|
appropriately. For example, to use a @samp{\} as the field separator
|
|
on the command line, you would have to type:
|
|
|
|
@example
|
|
# same as FS = "\\"
|
|
awk -F\\\\ '@dots{}' files @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex @code{\} (backslash), as field separators
|
|
@cindex backslash (@code{\}), as field separators
|
|
Because @samp{\} is used for quoting in the shell, @command{awk} sees
|
|
@samp{-F\\}. Then @command{awk} processes the @samp{\\} for escape
|
|
characters (@pxref{Escape Sequences}), finally yielding
|
|
a single @samp{\} to use for the field separator.
|
|
|
|
@c @cindex historical features
|
|
As a special case, in compatibility mode
|
|
(@pxref{Options}),
|
|
if the argument to @option{-F} is @samp{t}, then @code{FS} is set to
|
|
the TAB character. If you type @samp{-F\t} at the
|
|
shell, without any quotes, the @samp{\} gets deleted, so @command{awk}
|
|
figures that you really want your fields to be separated with tabs and
|
|
not @samp{t}s. Use @samp{-v FS="t"} or @samp{-F"[t]"} on the command line
|
|
if you really do want to separate your fields with @samp{t}s.
|
|
|
|
For example, let's use an @command{awk} program file called @file{baud.awk}
|
|
that contains the pattern @code{/300/} and the action @samp{print $1}:
|
|
|
|
@example
|
|
/300/ @{ print $1 @}
|
|
@end example
|
|
|
|
Let's also set @code{FS} to be the @samp{-} character and run the
|
|
program on the file @file{BBS-list}. The following command prints a
|
|
list of the names of the bulletin boards that operate at 300 baud and
|
|
the first three digits of their phone numbers:
|
|
|
|
@c tweaked to make the tex output look better in @smallbook
|
|
@example
|
|
$ awk -F- -f baud.awk BBS-list
|
|
@print{} aardvark 555
|
|
@print{} alpo
|
|
@print{} barfly 555
|
|
@print{} bites 555
|
|
@print{} camelot 555
|
|
@print{} core 555
|
|
@print{} fooey 555
|
|
@print{} foot 555
|
|
@print{} macfoo 555
|
|
@print{} sdace 555
|
|
@print{} sabafoo 555
|
|
@end example
|
|
|
|
@noindent
|
|
Note the second line of output. The second line
|
|
in the original file looked like this:
|
|
|
|
@example
|
|
alpo-net 555-3412 2400/1200/300 A
|
|
@end example
|
|
|
|
The @samp{-} as part of the system's name was used as the field
|
|
separator, instead of the @samp{-} in the phone number that was
|
|
originally intended. This demonstrates why you have to be careful in
|
|
choosing your field and record separators.
|
|
|
|
@c The comma after "password files" does NOT start a tertiary
|
|
@cindex Unix @command{awk}, password files, field separators and
|
|
Perhaps the most common use of a single character as the field
|
|
separator occurs when processing the Unix system password file.
|
|
On many Unix systems, each user has a separate entry in the system password
|
|
file, one line per user. The information in these lines is separated
|
|
by colons. The first field is the user's logon name and the second is
|
|
the user's (encrypted or shadow) password. A password file entry might look
|
|
like this:
|
|
|
|
@cindex Robbins, Arnold
|
|
@example
|
|
arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/bash
|
|
@end example
|
|
|
|
The following program searches the system password file and prints
|
|
the entries for users who have no password:
|
|
|
|
@example
|
|
awk -F: '$2 == ""' /etc/passwd
|
|
@end example
|
|
|
|
@node Field Splitting Summary
|
|
@subsection Field-Splitting Summary
|
|
|
|
It is important to remember that when you assign a string constant
|
|
as the value of @code{FS}, it undergoes normal @command{awk} string
|
|
processing. For example, with Unix @command{awk} and @command{gawk},
|
|
the assignment @samp{FS = "\.."} assigns the character string @code{".."}
|
|
to @code{FS} (the backslash is stripped). This creates a regexp meaning
|
|
``fields are separated by occurrences of any two characters.''
|
|
If instead you want fields to be separated by a literal period followed
|
|
by any single character, use @samp{FS = "\\.."}.
|
|
|
|
The following table summarizes how fields are split, based on the value
|
|
of @code{FS} (@samp{==} means ``is equal to''):
|
|
|
|
@table @code
|
|
@item FS == " "
|
|
Fields are separated by runs of whitespace. Leading and trailing
|
|
whitespace are ignored. This is the default.
|
|
|
|
@item FS == @var{any other single character}
|
|
Fields are separated by each occurrence of the character. Multiple
|
|
successive occurrences delimit empty fields, as do leading and
|
|
trailing occurrences.
|
|
The character can even be a regexp metacharacter; it does not need
|
|
to be escaped.
|
|
|
|
@item FS == @var{regexp}
|
|
Fields are separated by occurrences of characters that match @var{regexp}.
|
|
Leading and trailing matches of @var{regexp} delimit empty fields.
|
|
|
|
@item FS == ""
|
|
Each individual character in the record becomes a separate field.
|
|
(This is a @command{gawk} extension; it is not specified by the
|
|
POSIX standard.)
|
|
@end table
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Changing @code{FS} Does Not Affect the Fields
|
|
|
|
@cindex POSIX @command{awk}, field separators and
|
|
@cindex field separators, POSIX and
|
|
According to the POSIX standard, @command{awk} is supposed to behave
|
|
as if each record is split into fields at the time it is read.
|
|
In particular, this means that if you change the value of @code{FS}
|
|
after a record is read, the value of the fields (i.e., how they were split)
|
|
should reflect the old value of @code{FS}, not the new one.
|
|
|
|
@cindex dark corner, field separators
|
|
@cindex @command{sed} utility
|
|
@cindex stream editors
|
|
However, many implementations of @command{awk} do not work this way. Instead,
|
|
they defer splitting the fields until a field is actually
|
|
referenced. The fields are split
|
|
using the @emph{current} value of @code{FS}!
|
|
@value{DARKCORNER}
|
|
This behavior can be difficult
|
|
to diagnose. The following example illustrates the difference
|
|
between the two methods.
|
|
(The @command{sed}@footnote{The @command{sed} utility is a ``stream editor.''
|
|
Its behavior is also defined by the POSIX standard.}
|
|
command prints just the first line of @file{/etc/passwd}.)
|
|
|
|
@example
|
|
sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}'
|
|
@end example
|
|
|
|
@noindent
|
|
which usually prints:
|
|
|
|
@example
|
|
root
|
|
@end example
|
|
|
|
@noindent
|
|
on an incorrect implementation of @command{awk}, while @command{gawk}
|
|
prints something like:
|
|
|
|
@example
|
|
root:nSijPlPhZZwgE:0:0:Root:/:
|
|
@end example
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: @code{FS} and @code{IGNORECASE}
|
|
|
|
The @code{IGNORECASE} variable
|
|
(@pxref{User-modified})
|
|
affects field splitting @emph{only} when the value of @code{FS} is a regexp.
|
|
It has no effect when @code{FS} is a single character, even if
|
|
that character is a letter. Thus, in the following code:
|
|
|
|
@example
|
|
FS = "c"
|
|
IGNORECASE = 1
|
|
$0 = "aCa"
|
|
print $1
|
|
@end example
|
|
|
|
@noindent
|
|
The output is @samp{aCa}. If you really want to split fields on an
|
|
alphabetic character while ignoring case, use a regexp that will
|
|
do it for you. E.g., @samp{FS = "[c]"}. In this case, @code{IGNORECASE}
|
|
will take effect.
|
|
|
|
@c ENDOFRANGE fisepr
|
|
@c ENDOFRANGE fisepg
|
|
|
|
@node Constant Size
|
|
@section Reading Fixed-Width Data
|
|
|
|
@ifnotinfo
|
|
@strong{Note:} This @value{SECTION} discusses an advanced
|
|
feature of @command{gawk}. If you are a novice @command{awk} user,
|
|
you might want to skip it on the first reading.
|
|
@end ifnotinfo
|
|
|
|
@ifinfo
|
|
(This @value{SECTION} discusses an advanced feature of @command{awk}.
|
|
If you are a novice @command{awk} user, you might want to skip it on
|
|
the first reading.)
|
|
@end ifinfo
|
|
|
|
@cindex data, fixed-width
|
|
@cindex fixed-width data
|
|
@cindex advanced features, fixed-width data
|
|
@command{gawk} @value{PVERSION} 2.13 introduced a facility for dealing with
|
|
fixed-width fields with no distinctive field separator. For example,
|
|
data of this nature arises in the input for old Fortran programs where
|
|
numbers are run together, or in the output of programs that did not
|
|
anticipate the use of their output as input for other programs.
|
|
|
|
An example of the latter is a table where all the columns are lined up by
|
|
the use of a variable number of spaces and @emph{empty fields are just
|
|
spaces}. Clearly, @command{awk}'s normal field splitting based on @code{FS}
|
|
does not work well in this case. Although a portable @command{awk} program
|
|
can use a series of @code{substr} calls on @code{$0}
|
|
(@pxref{String Functions}),
|
|
this is awkward and inefficient for a large number of fields.
|
|
|
|
@c comma before specifying is part of tertiary
|
|
@cindex troubleshooting, fatal errors, field widths, specifying
|
|
@cindex @command{w} utility
|
|
@cindex @code{FIELDWIDTHS} variable
|
|
The splitting of an input record into fixed-width fields is specified by
|
|
assigning a string containing space-separated numbers to the built-in
|
|
variable @code{FIELDWIDTHS}. Each number specifies the width of the field,
|
|
@emph{including} columns between fields. If you want to ignore the columns
|
|
between fields, you can specify the width as a separate field that is
|
|
subsequently ignored.
|
|
It is a fatal error to supply a field width that is not a positive number.
|
|
The following data is the output of the Unix @command{w} utility. It is useful
|
|
to illustrate the use of @code{FIELDWIDTHS}:
|
|
|
|
@example
|
|
@group
|
|
10:06pm up 21 days, 14:04, 23 users
|
|
User tty login@ idle JCPU PCPU what
|
|
hzuo ttyV0 8:58pm 9 5 vi p24.tex
|
|
hzang ttyV3 6:37pm 50 -csh
|
|
eklye ttyV5 9:53pm 7 1 em thes.tex
|
|
dportein ttyV6 8:17pm 1:47 -csh
|
|
gierd ttyD3 10:00pm 1 elm
|
|
dave ttyD4 9:47pm 4 4 w
|
|
brent ttyp0 26Jun91 4:46 26:46 4:41 bash
|
|
dave ttyq4 26Jun9115days 46 46 wnewmail
|
|
@end group
|
|
@end example
|
|
|
|
The following program takes the above input, converts the idle time to
|
|
number of seconds, and prints out the first two fields and the calculated
|
|
idle time:
|
|
|
|
@strong{Note:}
|
|
This program uses a number of @command{awk} features that
|
|
haven't been introduced yet.
|
|
|
|
@example
|
|
BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @}
|
|
NR > 2 @{
|
|
idle = $4
|
|
sub(/^ */, "", idle) # strip leading spaces
|
|
if (idle == "")
|
|
idle = 0
|
|
if (idle ~ /:/) @{
|
|
split(idle, t, ":")
|
|
idle = t[1] * 60 + t[2]
|
|
@}
|
|
if (idle ~ /days/)
|
|
idle *= 24 * 60 * 60
|
|
|
|
print $1, $2, idle
|
|
@}
|
|
@end example
|
|
|
|
Running the program on the data produces the following results:
|
|
|
|
@example
|
|
hzuo ttyV0 0
|
|
hzang ttyV3 50
|
|
eklye ttyV5 0
|
|
dportein ttyV6 107
|
|
gierd ttyD3 1
|
|
dave ttyD4 0
|
|
brent ttyp0 286
|
|
dave ttyq4 1296000
|
|
@end example
|
|
|
|
Another (possibly more practical) example of fixed-width input data
|
|
is the input from a deck of balloting cards. In some parts of
|
|
the United States, voters mark their choices by punching holes in computer
|
|
cards. These cards are then processed to count the votes for any particular
|
|
candidate or on any particular issue. Because a voter may choose not to
|
|
vote on some issue, any column on the card may be empty. An @command{awk}
|
|
program for processing such data could use the @code{FIELDWIDTHS} feature
|
|
to simplify reading the data. (Of course, getting @command{gawk} to run on
|
|
a system with card readers is another story!)
|
|
|
|
@ignore
|
|
Exercise: Write a ballot card reading program
|
|
@end ignore
|
|
|
|
@cindex @command{gawk}, splitting fields and
|
|
Assigning a value to @code{FS} causes @command{gawk} to use
|
|
@code{FS} for field splitting again. Use @samp{FS = FS} to make this happen,
|
|
without having to know the current value of @code{FS}.
|
|
In order to tell which kind of field splitting is in effect,
|
|
use @code{PROCINFO["FS"]}
|
|
(@pxref{Auto-set}).
|
|
The value is @code{"FS"} if regular field splitting is being used,
|
|
or it is @code{"FIELDWIDTHS"} if fixed-width field splitting is being used:
|
|
|
|
@example
|
|
if (PROCINFO["FS"] == "FS")
|
|
@var{regular field splitting} @dots{}
|
|
else
|
|
@var{fixed-width field splitting} @dots{}
|
|
@end example
|
|
|
|
This information is useful when writing a function
|
|
that needs to temporarily change @code{FS} or @code{FIELDWIDTHS},
|
|
read some records, and then restore the original settings
|
|
(@pxref{Passwd Functions},
|
|
for an example of such a function).
|
|
|
|
@node Multiple Line
|
|
@section Multiple-Line Records
|
|
|
|
@c STARTOFRANGE recm
|
|
@cindex records, multiline
|
|
@c STARTOFRANGE imr
|
|
@cindex input, multiline records
|
|
@c STARTOFRANGE frm
|
|
@cindex files, reading, multiline records
|
|
@cindex input, files, See input files
|
|
In some databases, a single line cannot conveniently hold all the
|
|
information in one entry. In such cases, you can use multiline
|
|
records. The first step in doing this is to choose your data format.
|
|
|
|
@cindex record separators, with multiline records
|
|
One technique is to use an unusual character or string to separate
|
|
records. For example, you could use the formfeed character (written
|
|
@samp{\f} in @command{awk}, as in C) to separate them, making each record
|
|
a page of the file. To do this, just set the variable @code{RS} to
|
|
@code{"\f"} (a string containing the formfeed character). Any
|
|
other character could equally well be used, as long as it won't be part
|
|
of the data in a record.
|
|
|
|
@cindex @code{RS} variable, multiline records and
|
|
Another technique is to have blank lines separate records. By a special
|
|
dispensation, an empty string as the value of @code{RS} indicates that
|
|
records are separated by one or more blank lines. When @code{RS} is set
|
|
to the empty string, each record always ends at the first blank line
|
|
encountered. The next record doesn't start until the first nonblank
|
|
line that follows. No matter how many blank lines appear in a row, they
|
|
all act as one record separator.
|
|
(Blank lines must be completely empty; lines that contain only
|
|
whitespace do not count.)
|
|
|
|
@cindex leftmost longest match
|
|
@cindex matching, leftmost longest
|
|
You can achieve the same effect as @samp{RS = ""} by assigning the
|
|
string @code{"\n\n+"} to @code{RS}. This regexp matches the newline
|
|
at the end of the record and one or more blank lines after the record.
|
|
In addition, a regular expression always matches the longest possible
|
|
sequence when there is a choice
|
|
(@pxref{Leftmost Longest}).
|
|
So the next record doesn't start until
|
|
the first nonblank line that follows---no matter how many blank lines
|
|
appear in a row, they are considered one record separator.
|
|
|
|
@cindex dark corner, multiline records
|
|
There is an important difference between @samp{RS = ""} and
|
|
@samp{RS = "\n\n+"}. In the first case, leading newlines in the input
|
|
@value{DF} are ignored, and if a file ends without extra blank lines
|
|
after the last record, the final newline is removed from the record.
|
|
In the second case, this special processing is not done.
|
|
@value{DARKCORNER}
|
|
|
|
@cindex field separators, in multiline records
|
|
Now that the input is separated into records, the second step is to
|
|
separate the fields in the record. One way to do this is to divide each
|
|
of the lines into fields in the normal manner. This happens by default
|
|
as the result of a special feature. When @code{RS} is set to the empty
|
|
string, @emph{and} @code{FS} is a set to a single character,
|
|
the newline character @emph{always} acts as a field separator.
|
|
This is in addition to whatever field separations result from
|
|
@code{FS}.@footnote{When @code{FS} is the null string (@code{""})
|
|
or a regexp, this special feature of @code{RS} does not apply.
|
|
It does apply to the default field separator of a single space:
|
|
@samp{FS = " "}.}
|
|
|
|
The original motivation for this special exception was probably to provide
|
|
useful behavior in the default case (i.e., @code{FS} is equal
|
|
to @w{@code{" "}}). This feature can be a problem if you really don't
|
|
want the newline character to separate fields, because there is no way to
|
|
prevent it. However, you can work around this by using the @code{split}
|
|
function to break up the record manually
|
|
(@pxref{String Functions}).
|
|
If you have a single character field separator, you can work around
|
|
the special feature in a different way, by making @code{FS} into a
|
|
regexp for that single character. For example, if the field
|
|
separator is a percent character, instead of
|
|
@samp{FS = "%"}, use @samp{FS = "[%]"}.
|
|
|
|
Another way to separate fields is to
|
|
put each field on a separate line: to do this, just set the
|
|
variable @code{FS} to the string @code{"\n"}. (This single
|
|
character seperator matches a single newline.)
|
|
A practical example of a @value{DF} organized this way might be a mailing
|
|
list, where each entry is separated by blank lines. Consider a mailing
|
|
list in a file named @file{addresses}, which looks like this:
|
|
|
|
@example
|
|
Jane Doe
|
|
123 Main Street
|
|
Anywhere, SE 12345-6789
|
|
|
|
John Smith
|
|
456 Tree-lined Avenue
|
|
Smallville, MW 98765-4321
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
A simple program to process this file is as follows:
|
|
|
|
@example
|
|
# addrs.awk --- simple mailing list program
|
|
|
|
# Records are separated by blank lines.
|
|
# Each line is one field.
|
|
BEGIN @{ RS = "" ; FS = "\n" @}
|
|
|
|
@{
|
|
print "Name is:", $1
|
|
print "Address is:", $2
|
|
print "City and State are:", $3
|
|
print ""
|
|
@}
|
|
@end example
|
|
|
|
Running the program produces the following output:
|
|
|
|
@example
|
|
$ awk -f addrs.awk addresses
|
|
@print{} Name is: Jane Doe
|
|
@print{} Address is: 123 Main Street
|
|
@print{} City and State are: Anywhere, SE 12345-6789
|
|
@print{}
|
|
@print{} Name is: John Smith
|
|
@print{} Address is: 456 Tree-lined Avenue
|
|
@print{} City and State are: Smallville, MW 98765-4321
|
|
@print{}
|
|
@dots{}
|
|
@end example
|
|
|
|
@xref{Labels Program}, for a more realistic
|
|
program that deals with address lists.
|
|
The following
|
|
table
|
|
summarizes how records are split, based on the
|
|
value of
|
|
@ifinfo
|
|
@code{RS}.
|
|
(@samp{==} means ``is equal to.'')
|
|
@end ifinfo
|
|
@ifnotinfo
|
|
@code{RS}:
|
|
@end ifnotinfo
|
|
|
|
@table @code
|
|
@item RS == "\n"
|
|
Records are separated by the newline character (@samp{\n}). In effect,
|
|
every line in the @value{DF} is a separate record, including blank lines.
|
|
This is the default.
|
|
|
|
@item RS == @var{any single character}
|
|
Records are separated by each occurrence of the character. Multiple
|
|
successive occurrences delimit empty records.
|
|
|
|
@item RS == ""
|
|
Records are separated by runs of blank lines. The newline character
|
|
always serves as a field separator, in addition to whatever value
|
|
@code{FS} may have. Leading and trailing newlines in a file are ignored.
|
|
|
|
@item RS == @var{regexp}
|
|
Records are separated by occurrences of characters that match @var{regexp}.
|
|
Leading and trailing matches of @var{regexp} delimit empty records.
|
|
(This is a @command{gawk} extension; it is not specified by the
|
|
POSIX standard.)
|
|
@end table
|
|
|
|
@cindex @code{RT} variable
|
|
In all cases, @command{gawk} sets @code{RT} to the input text that matched the
|
|
value specified by @code{RS}.
|
|
@c ENDOFRANGE recm
|
|
@c ENDOFRANGE imr
|
|
@c ENDOFRANGE frm
|
|
|
|
@node Getline
|
|
@section Explicit Input with @code{getline}
|
|
|
|
@c STARTOFRANGE getl
|
|
@cindex @code{getline} command, explicit input with
|
|
@cindex input, explicit
|
|
So far we have been getting our input data from @command{awk}'s main
|
|
input stream---either the standard input (usually your terminal, sometimes
|
|
the output from another program) or from the
|
|
files specified on the command line. The @command{awk} language has a
|
|
special built-in command called @code{getline} that
|
|
can be used to read input under your explicit control.
|
|
|
|
The @code{getline} command is used in several different ways and should
|
|
@emph{not} be used by beginners.
|
|
The examples that follow the explanation of the @code{getline} command
|
|
include material that has not been covered yet. Therefore, come back
|
|
and study the @code{getline} command @emph{after} you have reviewed the
|
|
rest of this @value{DOCUMENT} and have a good knowledge of how @command{awk} works.
|
|
|
|
@cindex @code{ERRNO} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{getline} command
|
|
@cindex @code{getline} command, return values
|
|
The @code{getline} command returns one if it finds a record and zero if
|
|
it encounters the end of the file. If there is some error in getting
|
|
a record, such as a file that cannot be opened, then @code{getline}
|
|
returns @minus{}1. In this case, @command{gawk} sets the variable
|
|
@code{ERRNO} to a string describing the error that occurred.
|
|
|
|
In the following examples, @var{command} stands for a string value that
|
|
represents a shell command.
|
|
|
|
@menu
|
|
* Plain Getline:: Using @code{getline} with no arguments.
|
|
* Getline/Variable:: Using @code{getline} into a variable.
|
|
* Getline/File:: Using @code{getline} from a file.
|
|
* Getline/Variable/File:: Using @code{getline} into a variable from a
|
|
file.
|
|
* Getline/Pipe:: Using @code{getline} from a pipe.
|
|
* Getline/Variable/Pipe:: Using @code{getline} into a variable from a
|
|
pipe.
|
|
* Getline/Coprocess:: Using @code{getline} from a coprocess.
|
|
* Getline/Variable/Coprocess:: Using @code{getline} into a variable from a
|
|
coprocess.
|
|
* Getline Notes:: Important things to know about @code{getline}.
|
|
* Getline Summary:: Summary of @code{getline} Variants.
|
|
@end menu
|
|
|
|
@node Plain Getline
|
|
@subsection Using @code{getline} with No Arguments
|
|
|
|
The @code{getline} command can be used without arguments to read input
|
|
from the current input file. All it does in this case is read the next
|
|
input record and split it up into fields. This is useful if you've
|
|
finished processing the current record, but want to do some special
|
|
processing on the next record @emph{right now}. For example:
|
|
|
|
@example
|
|
@{
|
|
if ((t = index($0, "/*")) != 0) @{
|
|
# value of `tmp' will be "" if t is 1
|
|
tmp = substr($0, 1, t - 1)
|
|
u = index(substr($0, t + 2), "*/")
|
|
while (u == 0) @{
|
|
if (getline <= 0) @{
|
|
m = "unexpected EOF or error"
|
|
m = (m ": " ERRNO)
|
|
print m > "/dev/stderr"
|
|
exit
|
|
@}
|
|
t = -1
|
|
u = index($0, "*/")
|
|
@}
|
|
# substr expression will be "" if */
|
|
# occurred at end of line
|
|
$0 = tmp substr($0, u + 2)
|
|
@}
|
|
print $0
|
|
@}
|
|
@end example
|
|
|
|
This @command{awk} program deletes all C-style comments (@samp{/* @dots{}
|
|
*/}) from the input. By replacing the @samp{print $0} with other
|
|
statements, you could perform more complicated processing on the
|
|
decommented input, such as searching for matches of a regular
|
|
expression. (This program has a subtle problem---it does not work if one
|
|
comment ends and another begins on the same line.)
|
|
|
|
@ignore
|
|
Exercise,
|
|
write a program that does handle multiple comments on the line.
|
|
@end ignore
|
|
|
|
This form of the @code{getline} command sets @code{NF},
|
|
@code{NR}, @code{FNR}, and the value of @code{$0}.
|
|
|
|
@strong{Note:} The new value of @code{$0} is used to test
|
|
the patterns of any subsequent rules. The original value
|
|
of @code{$0} that triggered the rule that executed @code{getline}
|
|
is lost.
|
|
By contrast, the @code{next} statement reads a new record
|
|
but immediately begins processing it normally, starting with the first
|
|
rule in the program. @xref{Next Statement}.
|
|
|
|
@node Getline/Variable
|
|
@subsection Using @code{getline} into a Variable
|
|
@c comma before using is NOT for tertiary
|
|
@cindex variables, @code{getline} command into, using
|
|
|
|
You can use @samp{getline @var{var}} to read the next record from
|
|
@command{awk}'s input into the variable @var{var}. No other processing is
|
|
done.
|
|
For example, suppose the next line is a comment or a special string,
|
|
and you want to read it without triggering
|
|
any rules. This form of @code{getline} allows you to read that line
|
|
and store it in a variable so that the main
|
|
read-a-line-and-check-each-rule loop of @command{awk} never sees it.
|
|
The following example swaps every two lines of input:
|
|
|
|
@example
|
|
@{
|
|
if ((getline tmp) > 0) @{
|
|
print tmp
|
|
print $0
|
|
@} else
|
|
print $0
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
It takes the following list:
|
|
|
|
@example
|
|
wan
|
|
tew
|
|
free
|
|
phore
|
|
@end example
|
|
|
|
@noindent
|
|
and produces these results:
|
|
|
|
@example
|
|
tew
|
|
wan
|
|
phore
|
|
free
|
|
@end example
|
|
|
|
The @code{getline} command used in this way sets only the variables
|
|
@code{NR} and @code{FNR} (and of course, @var{var}). The record is not
|
|
split into fields, so the values of the fields (including @code{$0}) and
|
|
the value of @code{NF} do not change.
|
|
|
|
@node Getline/File
|
|
@subsection Using @code{getline} from a File
|
|
|
|
@cindex input redirection
|
|
@cindex redirection of input
|
|
@cindex @code{<} (left angle bracket), @code{<} operator (I/O)
|
|
@cindex left angle bracket (@code{<}), @code{<} operator (I/O)
|
|
@cindex operators, input/output
|
|
Use @samp{getline < @var{file}} to read the next record from @var{file}.
|
|
Here @var{file} is a string-valued expression that
|
|
specifies the @value{FN}. @samp{< @var{file}} is called a @dfn{redirection}
|
|
because it directs input to come from a different place.
|
|
For example, the following
|
|
program reads its input record from the file @file{secondary.input} when it
|
|
encounters a first field with a value equal to 10 in the current input
|
|
file:
|
|
|
|
@example
|
|
@{
|
|
if ($1 == 10) @{
|
|
getline < "secondary.input"
|
|
print
|
|
@} else
|
|
print
|
|
@}
|
|
@end example
|
|
|
|
Because the main input stream is not used, the values of @code{NR} and
|
|
@code{FNR} are not changed. However, the record it reads is split into fields in
|
|
the normal manner, so the values of @code{$0} and the other fields are
|
|
changed, resulting in a new value of @code{NF}.
|
|
|
|
@cindex POSIX @command{awk}, @code{<} operator and
|
|
@c Thanks to Paul Eggert for initial wording here
|
|
According to POSIX, @samp{getline < @var{expression}} is ambiguous if
|
|
@var{expression} contains unparenthesized operators other than
|
|
@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
|
|
because the concatenation operator is not parenthesized. You should
|
|
write it as @samp{getline < (dir "/" file)} if you want your program
|
|
to be portable to other @command{awk} implementations.
|
|
|
|
@node Getline/Variable/File
|
|
@subsection Using @code{getline} into a Variable from a File
|
|
@c comma before using is NOT for tertiary
|
|
@cindex variables, @code{getline} command into, using
|
|
|
|
Use @samp{getline @var{var} < @var{file}} to read input
|
|
from the file
|
|
@var{file}, and put it in the variable @var{var}. As above, @var{file}
|
|
is a string-valued expression that specifies the file from which to read.
|
|
|
|
In this version of @code{getline}, none of the built-in variables are
|
|
changed and the record is not split into fields. The only variable
|
|
changed is @var{var}.
|
|
For example, the following program copies all the input files to the
|
|
output, except for records that say @w{@samp{@@include @var{filename}}}.
|
|
Such a record is replaced by the contents of the file
|
|
@var{filename}:
|
|
|
|
@example
|
|
@{
|
|
if (NF == 2 && $1 == "@@include") @{
|
|
while ((getline line < $2) > 0)
|
|
print line
|
|
close($2)
|
|
@} else
|
|
print
|
|
@}
|
|
@end example
|
|
|
|
Note here how the name of the extra input file is not built into
|
|
the program; it is taken directly from the data, specifically from the second field on
|
|
the @samp{@@include} line.
|
|
|
|
@cindex @code{close} function
|
|
The @code{close} function is called to ensure that if two identical
|
|
@samp{@@include} lines appear in the input, the entire specified file is
|
|
included twice.
|
|
@xref{Close Files And Pipes}.
|
|
|
|
One deficiency of this program is that it does not process nested
|
|
@samp{@@include} statements
|
|
(i.e., @samp{@@include} statements in included files)
|
|
the way a true macro preprocessor would.
|
|
@xref{Igawk Program}, for a program
|
|
that does handle nested @samp{@@include} statements.
|
|
|
|
@node Getline/Pipe
|
|
@subsection Using @code{getline} from a Pipe
|
|
|
|
@cindex @code{|} (vertical bar), @code{|} operator (I/O)
|
|
@cindex vertical bar (@code{|}), @code{|} operator (I/O)
|
|
@cindex input pipeline
|
|
@cindex pipes, input
|
|
@cindex operators, input/output
|
|
The output of a command can also be piped into @code{getline}, using
|
|
@samp{@var{command} | getline}. In
|
|
this case, the string @var{command} is run as a shell command and its output
|
|
is piped into @command{awk} to be used as input. This form of @code{getline}
|
|
reads one record at a time from the pipe.
|
|
For example, the following program copies its input to its output, except for
|
|
lines that begin with @samp{@@execute}, which are replaced by the output
|
|
produced by running the rest of the line as a shell command:
|
|
|
|
@example
|
|
@{
|
|
if ($1 == "@@execute") @{
|
|
tmp = substr($0, 10)
|
|
while ((tmp | getline) > 0)
|
|
print
|
|
close(tmp)
|
|
@} else
|
|
print
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex @code{close} function
|
|
The @code{close} function is called to ensure that if two identical
|
|
@samp{@@execute} lines appear in the input, the command is run for
|
|
each one.
|
|
@ifnottex
|
|
@xref{Close Files And Pipes}.
|
|
@end ifnottex
|
|
@c Exercise!!
|
|
@c This example is unrealistic, since you could just use system
|
|
Given the input:
|
|
|
|
@example
|
|
foo
|
|
bar
|
|
baz
|
|
@@execute who
|
|
bletch
|
|
@end example
|
|
|
|
@noindent
|
|
the program might produce:
|
|
|
|
@cindex Robbins, Bill
|
|
@cindex Robbins, Miriam
|
|
@cindex Robbins, Arnold
|
|
@example
|
|
foo
|
|
bar
|
|
baz
|
|
arnold ttyv0 Jul 13 14:22
|
|
miriam ttyp0 Jul 13 14:23 (murphy:0)
|
|
bill ttyp1 Jul 13 14:23 (murphy:0)
|
|
bletch
|
|
@end example
|
|
|
|
@noindent
|
|
Notice that this program ran the command @command{who} and printed the previous result.
|
|
(If you try this program yourself, you will of course get different results,
|
|
depending upon who is logged in on your system.)
|
|
|
|
This variation of @code{getline} splits the record into fields, sets the
|
|
value of @code{NF}, and recomputes the value of @code{$0}. The values of
|
|
@code{NR} and @code{FNR} are not changed.
|
|
|
|
@cindex POSIX @command{awk}, @code{|} I/O operator and
|
|
@c Thanks to Paul Eggert for initial wording here
|
|
According to POSIX, @samp{@var{expression} | getline} is ambiguous if
|
|
@var{expression} contains unparenthesized operators other than
|
|
@samp{$}---for example, @samp{@w{"echo "} "date" | getline} is ambiguous
|
|
because the concatenation operator is not parenthesized. You should
|
|
write it as @samp{(@w{"echo "} "date") | getline} if you want your program
|
|
to be portable to other @command{awk} implementations.
|
|
|
|
@node Getline/Variable/Pipe
|
|
@subsection Using @code{getline} into a Variable from a Pipe
|
|
@c comma before using is NOT for tertiary
|
|
@cindex variables, @code{getline} command into, using
|
|
|
|
When you use @samp{@var{command} | getline @var{var}}, the
|
|
output of @var{command} is sent through a pipe to
|
|
@code{getline} and into the variable @var{var}. For example, the
|
|
following program reads the current date and time into the variable
|
|
@code{current_time}, using the @command{date} utility, and then
|
|
prints it:
|
|
|
|
@example
|
|
BEGIN @{
|
|
"date" | getline current_time
|
|
close("date")
|
|
print "Report printed on " current_time
|
|
@}
|
|
@end example
|
|
|
|
In this version of @code{getline}, none of the built-in variables are
|
|
changed and the record is not split into fields.
|
|
|
|
@ifinfo
|
|
@c Thanks to Paul Eggert for initial wording here
|
|
According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if
|
|
@var{expression} contains unparenthesized operators other than
|
|
@samp{$}; for example, @samp{@w{"echo "} "date" | getline @var{var}} is ambiguous
|
|
because the concatenation operator is not parenthesized. You should
|
|
write it as @samp{(@w{"echo "} "date") | getline @var{var}} if you want your
|
|
program to be portable to other @command{awk} implementations.
|
|
@end ifinfo
|
|
|
|
@node Getline/Coprocess
|
|
@subsection Using @code{getline} from a Coprocess
|
|
@cindex coprocesses, @code{getline} from
|
|
@c comma before using is NOT for tertiary
|
|
@cindex @code{getline} command, coprocesses, using from
|
|
@cindex @code{|} (vertical bar), @code{|&} operator (I/O)
|
|
@cindex vertical bar (@code{|}), @code{|&} operator (I/O)
|
|
@cindex operators, input/output
|
|
@cindex differences in @command{awk} and @command{gawk}, input/output operators
|
|
|
|
Input into @code{getline} from a pipe is a one-way operation.
|
|
The command that is started with @samp{@var{command} | getline} only
|
|
sends data @emph{to} your @command{awk} program.
|
|
|
|
On occasion, you might want to send data to another program
|
|
for processing and then read the results back.
|
|
@command{gawk} allows you start a @dfn{coprocess}, with which two-way
|
|
communications are possible. This is done with the @samp{|&}
|
|
operator.
|
|
Typically, you write data to the coprocess first and then
|
|
read results back, as shown in the following:
|
|
|
|
@example
|
|
print "@var{some query}" |& "db_server"
|
|
"db_server" |& getline
|
|
@end example
|
|
|
|
@noindent
|
|
which sends a query to @command{db_server} and then reads the results.
|
|
|
|
The values of @code{NR} and
|
|
@code{FNR} are not changed,
|
|
because the main input stream is not used.
|
|
However, the record is split into fields in
|
|
the normal manner, thus changing the values of @code{$0}, of the other fields,
|
|
and of @code{NF}.
|
|
|
|
Coprocesses are an advanced feature. They are discussed here only because
|
|
this is the @value{SECTION} on @code{getline}.
|
|
@xref{Two-way I/O},
|
|
where coprocesses are discussed in more detail.
|
|
|
|
@node Getline/Variable/Coprocess
|
|
@subsection Using @code{getline} into a Variable from a Coprocess
|
|
@c comma before using is NOT for tertiary
|
|
@cindex variables, @code{getline} command into, using
|
|
|
|
When you use @samp{@var{command} |& getline @var{var}}, the output from
|
|
the coprocess @var{command} is sent through a two-way pipe to @code{getline}
|
|
and into the variable @var{var}.
|
|
|
|
In this version of @code{getline}, none of the built-in variables are
|
|
changed and the record is not split into fields. The only variable
|
|
changed is @var{var}.
|
|
|
|
@ifinfo
|
|
Coprocesses are an advanced feature. They are discussed here only because
|
|
this is the @value{SECTION} on @code{getline}.
|
|
@xref{Two-way I/O},
|
|
where coprocesses are discussed in more detail.
|
|
@end ifinfo
|
|
|
|
@node Getline Notes
|
|
@subsection Points to Remember About @code{getline}
|
|
Here are some miscellaneous points about @code{getline} that
|
|
you should bear in mind:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
When @code{getline} changes the value of @code{$0} and @code{NF},
|
|
@command{awk} does @emph{not} automatically jump to the start of the
|
|
program and start testing the new record against every pattern.
|
|
However, the new record is tested against any subsequent rules.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, implementation limitations
|
|
@cindex implementation issues, @command{gawk}, limits
|
|
@cindex @command{awk}, implementations, limits
|
|
@cindex @command{gawk}, implementation issues, limits
|
|
@item
|
|
Many @command{awk} implementations limit the number of pipelines that an @command{awk}
|
|
program may have open to just one. In @command{gawk}, there is no such limit.
|
|
You can open as many pipelines (and coprocesses) as the underlying operating
|
|
system permits.
|
|
|
|
@cindex side effects, @code{FILENAME} variable
|
|
@c The comma before "setting with" does NOT represent a tertiary
|
|
@cindex @code{FILENAME} variable, @code{getline}, setting with
|
|
@cindex dark corner, @code{FILENAME} variable
|
|
@cindex @code{getline} command, @code{FILENAME} variable and
|
|
@cindex @code{BEGIN} pattern, @code{getline} and
|
|
@item
|
|
An interesting side effect occurs if you use @code{getline} without a
|
|
redirection inside a @code{BEGIN} rule. Because an unredirected @code{getline}
|
|
reads from the command-line @value{DF}s, the first @code{getline} command
|
|
causes @command{awk} to set the value of @code{FILENAME}. Normally,
|
|
@code{FILENAME} does not have a value inside @code{BEGIN} rules, because you
|
|
have not yet started to process the command-line @value{DF}s.
|
|
@value{DARKCORNER}
|
|
(@xref{BEGIN/END},
|
|
also @pxref{Auto-set}.)
|
|
|
|
@item
|
|
Using @code{FILENAME} with @code{getline}
|
|
(@samp{getline < FILENAME})
|
|
is likely to be a source for
|
|
confusion. @command{awk} opens a separate input stream from the
|
|
current input file. However, by not using a variable, @code{$0}
|
|
and @code{NR} are still updated. If you're doing this, it's
|
|
probably by accident, and you should reconsider what it is you're
|
|
trying to accomplish.
|
|
@end itemize
|
|
|
|
@node Getline Summary
|
|
@subsection Summary of @code{getline} Variants
|
|
@cindex @code{getline} command, variants
|
|
|
|
The following table summarizes the eight variants of @code{getline},
|
|
listing which built-in variables are set by each one.
|
|
|
|
@multitable {@var{command} @code{|& getline} @var{var}} {1234567890123456789012345678901234567890}
|
|
@item @code{getline} @tab Sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}
|
|
|
|
@item @code{getline} @var{var} @tab Sets @var{var}, @code{FNR}, and @code{NR}
|
|
|
|
@item @code{getline <} @var{file} @tab Sets @code{$0} and @code{NF}
|
|
|
|
@item @code{getline @var{var} < @var{file}} @tab Sets @var{var}
|
|
|
|
@item @var{command} @code{| getline} @tab Sets @code{$0} and @code{NF}
|
|
|
|
@item @var{command} @code{| getline} @var{var} @tab Sets @var{var}
|
|
|
|
@item @var{command} @code{|& getline} @tab Sets @code{$0} and @code{NF}.
|
|
This is a @command{gawk} extension
|
|
|
|
@item @var{command} @code{|& getline} @var{var} @tab Sets @var{var}.
|
|
This is a @command{gawk} extension
|
|
@end multitable
|
|
@c ENDOFRANGE getl
|
|
@c ENDOFRANGE inex
|
|
@c ENDOFRANGE infir
|
|
|
|
@node Printing
|
|
@chapter Printing Output
|
|
|
|
@c STARTOFRANGE prnt
|
|
@cindex printing
|
|
@cindex output, printing, See printing
|
|
One of the most common programming actions is to @dfn{print}, or output,
|
|
some or all of the input. Use the @code{print} statement
|
|
for simple output, and the @code{printf} statement
|
|
for fancier formatting.
|
|
The @code{print} statement is not limited when
|
|
computing @emph{which} values to print. However, with two exceptions,
|
|
you cannot specify @emph{how} to print them---how many
|
|
columns, whether to use exponential notation or not, and so on.
|
|
(For the exceptions, @pxref{Output Separators}, and
|
|
@ref{OFMT}.)
|
|
For printing with specifications, you need the @code{printf} statement
|
|
(@pxref{Printf}).
|
|
|
|
@c STARTOFRANGE prnts
|
|
@cindex @code{print} statement
|
|
@cindex @code{printf} statement
|
|
Besides basic and formatted printing, this @value{CHAPTER}
|
|
also covers I/O redirections to files and pipes, introduces
|
|
the special @value{FN}s that @command{gawk} processes internally,
|
|
and discusses the @code{close} built-in function.
|
|
|
|
@menu
|
|
* Print:: The @code{print} statement.
|
|
* Print Examples:: Simple examples of @code{print} statements.
|
|
* Output Separators:: The output separators and how to change them.
|
|
* OFMT:: Controlling Numeric Output With @code{print}.
|
|
* Printf:: The @code{printf} statement.
|
|
* Redirection:: How to redirect output to multiple files and
|
|
pipes.
|
|
* Special Files:: File name interpretation in @command{gawk}.
|
|
@command{gawk} allows access to inherited file
|
|
descriptors.
|
|
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
|
|
@end menu
|
|
|
|
@node Print
|
|
@section The @code{print} Statement
|
|
|
|
The @code{print} statement is used to produce output with simple, standardized
|
|
formatting. Specify only the strings or numbers to print, in a
|
|
list separated by commas. They are output, separated by single spaces,
|
|
followed by a newline. The statement looks like this:
|
|
|
|
@example
|
|
print @var{item1}, @var{item2}, @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
The entire list of items may be optionally enclosed in parentheses. The
|
|
parentheses are necessary if any of the item expressions uses the @samp{>}
|
|
relational operator; otherwise it could be confused with a redirection
|
|
(@pxref{Redirection}).
|
|
|
|
The items to print can be constant strings or numbers, fields of the
|
|
current record (such as @code{$1}), variables, or any @command{awk}
|
|
expression. Numeric values are converted to strings and then printed.
|
|
|
|
@cindex records, printing
|
|
@cindex lines, blank, printing
|
|
@cindex text, printing
|
|
The simple statement @samp{print} with no items is equivalent to
|
|
@samp{print $0}: it prints the entire current record. To print a blank
|
|
line, use @samp{print ""}, where @code{""} is the empty string.
|
|
To print a fixed piece of text, use a string constant, such as
|
|
@w{@code{"Don't Panic"}}, as one item. If you forget to use the
|
|
double-quote characters, your text is taken as an @command{awk}
|
|
expression, and you will probably get an error. Keep in mind that a
|
|
space is printed between any two items.
|
|
|
|
@node Print Examples
|
|
@section Examples of @code{print} Statements
|
|
|
|
Each @code{print} statement makes at least one line of output. However, it
|
|
isn't limited to only one line. If an item value is a string that contains a
|
|
newline, the newline is output along with the rest of the string. A
|
|
single @code{print} statement can make any number of lines this way.
|
|
|
|
@cindex newlines, printing
|
|
The following is an example of printing a string that contains embedded newlines
|
|
(the @samp{\n} is an escape sequence, used to represent the newline
|
|
character; @pxref{Escape Sequences}):
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print "line one\nline two\nline three" @}'
|
|
@print{} line one
|
|
@print{} line two
|
|
@print{} line three
|
|
@end example
|
|
|
|
@cindex fields, printing
|
|
The next example, which is run on the @file{inventory-shipped} file,
|
|
prints the first two fields of each input record, with a space between
|
|
them:
|
|
|
|
@example
|
|
$ awk '@{ print $1, $2 @}' inventory-shipped
|
|
@print{} Jan 13
|
|
@print{} Feb 15
|
|
@print{} Mar 15
|
|
@dots{}
|
|
@end example
|
|
|
|
@cindex @code{print} statement, commas, omitting
|
|
@c comma does NOT start tertiary
|
|
@cindex troubleshooting, @code{print} statement, omitting commas
|
|
A common mistake in using the @code{print} statement is to omit the comma
|
|
between two items. This often has the effect of making the items run
|
|
together in the output, with no space. The reason for this is that
|
|
juxtaposing two string expressions in @command{awk} means to concatenate
|
|
them. Here is the same program, without the comma:
|
|
|
|
@example
|
|
$ awk '@{ print $1 $2 @}' inventory-shipped
|
|
@print{} Jan13
|
|
@print{} Feb15
|
|
@print{} Mar15
|
|
@dots{}
|
|
@end example
|
|
|
|
@c comma does NOT start tertiary
|
|
@cindex @code{BEGIN} pattern, headings, adding
|
|
To someone unfamiliar with the @file{inventory-shipped} file, neither
|
|
example's output makes much sense. A heading line at the beginning
|
|
would make it clearer. Let's add some headings to our table of months
|
|
(@code{$1}) and green crates shipped (@code{$2}). We do this using the
|
|
@code{BEGIN} pattern
|
|
(@pxref{BEGIN/END})
|
|
so that the headings are only printed once:
|
|
|
|
@example
|
|
awk 'BEGIN @{ print "Month Crates"
|
|
print "----- ------" @}
|
|
@{ print $1, $2 @}' inventory-shipped
|
|
@end example
|
|
|
|
@noindent
|
|
When run, the program prints the following:
|
|
|
|
@example
|
|
Month Crates
|
|
----- ------
|
|
Jan 13
|
|
Feb 15
|
|
Mar 15
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
The only problem, however, is that the headings and the table data
|
|
don't line up! We can fix this by printing some spaces between the
|
|
two fields:
|
|
|
|
@example
|
|
@group
|
|
awk 'BEGIN @{ print "Month Crates"
|
|
print "----- ------" @}
|
|
@{ print $1, " ", $2 @}' inventory-shipped
|
|
@end group
|
|
@end example
|
|
|
|
@c comma does NOT start tertiary
|
|
@cindex @code{printf} statement, columns, aligning
|
|
@cindex columns, aligning
|
|
Lining up columns this way can get pretty
|
|
complicated when there are many columns to fix. Counting spaces for two
|
|
or three columns is simple, but any more than this can take up
|
|
a lot of time. This is why the @code{printf} statement was
|
|
created (@pxref{Printf});
|
|
one of its specialties is lining up columns of data.
|
|
|
|
@cindex line continuations, in @code{print} statement
|
|
@cindex @code{print} statement, line continuations and
|
|
@strong{Note:} You can continue either a @code{print} or
|
|
@code{printf} statement simply by putting a newline after any comma
|
|
(@pxref{Statements/Lines}).
|
|
@c ENDOFRANGE prnts
|
|
|
|
@node Output Separators
|
|
@section Output Separators
|
|
|
|
@cindex @code{OFS} variable
|
|
As mentioned previously, a @code{print} statement contains a list
|
|
of items separated by commas. In the output, the items are normally
|
|
separated by single spaces. However, this doesn't need to be the case;
|
|
a single space is only the default. Any string of
|
|
characters may be used as the @dfn{output field separator} by setting the
|
|
built-in variable @code{OFS}. The initial value of this variable
|
|
is the string @w{@code{" "}}---that is, a single space.
|
|
|
|
The output from an entire @code{print} statement is called an
|
|
@dfn{output record}. Each @code{print} statement outputs one output
|
|
record, and then outputs a string called the @dfn{output record separator}
|
|
(or @code{ORS}). The initial
|
|
value of @code{ORS} is the string @code{"\n"}; i.e., a newline
|
|
character. Thus, each @code{print} statement normally makes a separate line.
|
|
|
|
@cindex output, records
|
|
@cindex output record separator, See @code{ORS} variable
|
|
@cindex @code{ORS} variable
|
|
@cindex @code{BEGIN} pattern, @code{OFS}/@code{ORS} variables, assigning values to
|
|
In order to change how output fields and records are separated, assign
|
|
new values to the variables @code{OFS} and @code{ORS}. The usual
|
|
place to do this is in the @code{BEGIN} rule
|
|
(@pxref{BEGIN/END}), so
|
|
that it happens before any input is processed. It can also be done
|
|
with assignments on the command line, before the names of the input
|
|
files, or using the @option{-v} command-line option
|
|
(@pxref{Options}).
|
|
The following example prints the first and second fields of each input
|
|
record, separated by a semicolon, with a blank line added after each
|
|
newline:
|
|
|
|
@ignore
|
|
Exercise,
|
|
Rewrite the
|
|
@example
|
|
awk 'BEGIN @{ print "Month Crates"
|
|
print "----- ------" @}
|
|
@{ print $1, " ", $2 @}' inventory-shipped
|
|
@end example
|
|
program by using a new value of @code{OFS}.
|
|
@end ignore
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}
|
|
> @{ print $1, $2 @}' BBS-list
|
|
@print{} aardvark;555-5553
|
|
@print{}
|
|
@print{} alpo-net;555-3412
|
|
@print{}
|
|
@print{} barfly;555-7685
|
|
@dots{}
|
|
@end example
|
|
|
|
If the value of @code{ORS} does not contain a newline, the program's output
|
|
is run together on a single line.
|
|
|
|
@node OFMT
|
|
@section Controlling Numeric Output with @code{print}
|
|
@cindex numeric, output format
|
|
@c the comma does NOT start a secondary
|
|
@cindex formats, numeric output
|
|
When the @code{print} statement is used to print numeric values,
|
|
@command{awk} internally converts the number to a string of characters
|
|
and prints that string. @command{awk} uses the @code{sprintf} function
|
|
to do this conversion
|
|
(@pxref{String Functions}).
|
|
For now, it suffices to say that the @code{sprintf}
|
|
function accepts a @dfn{format specification} that tells it how to format
|
|
numbers (or strings), and that there are a number of different ways in which
|
|
numbers can be formatted. The different format specifications are discussed
|
|
more fully in
|
|
@ref{Control Letters}.
|
|
|
|
@cindex @code{sprintf} function
|
|
@cindex @code{OFMT} variable
|
|
@c the comma before OFMT does NOT start a tertiary
|
|
@cindex output, format specifier, @code{OFMT}
|
|
The built-in variable @code{OFMT} contains the default format specification
|
|
that @code{print} uses with @code{sprintf} when it wants to convert a
|
|
number to a string for printing.
|
|
The default value of @code{OFMT} is @code{"%.6g"}.
|
|
The way @code{print} prints numbers can be changed
|
|
by supplying different format specifications
|
|
as the value of @code{OFMT}, as shown in the following example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{
|
|
> OFMT = "%.0f" # print numbers as integers (rounds)
|
|
> print 17.23, 17.54 @}'
|
|
@print{} 17 18
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex dark corner, @code{OFMT} variable
|
|
@cindex POSIX @command{awk}, @code{OFMT} variable and
|
|
@cindex @code{OFMT} variable, POSIX @command{awk} and
|
|
According to the POSIX standard, @command{awk}'s behavior is undefined
|
|
if @code{OFMT} contains anything but a floating-point conversion specification.
|
|
@value{DARKCORNER}
|
|
|
|
@node Printf
|
|
@section Using @code{printf} Statements for Fancier Printing
|
|
|
|
@c STARTOFRANGE printfs
|
|
@cindex @code{printf} statement
|
|
@cindex output, formatted
|
|
@cindex formatting output
|
|
For more precise control over the output format than what is
|
|
normally provided by @code{print}, use @code{printf}.
|
|
@code{printf} can be used to
|
|
specify the width to use for each item, as well as various
|
|
formatting choices for numbers (such as what output base to use, whether to
|
|
print an exponent, whether to print a sign, and how many digits to print
|
|
after the decimal point). This is done by supplying a string, called
|
|
the @dfn{format string}, that controls how and where to print the other
|
|
arguments.
|
|
|
|
@menu
|
|
* Basic Printf:: Syntax of the @code{printf} statement.
|
|
* Control Letters:: Format-control letters.
|
|
* Format Modifiers:: Format-specification modifiers.
|
|
* Printf Examples:: Several examples.
|
|
@end menu
|
|
|
|
@node Basic Printf
|
|
@subsection Introduction to the @code{printf} Statement
|
|
|
|
@cindex @code{printf} statement, syntax of
|
|
A simple @code{printf} statement looks like this:
|
|
|
|
@example
|
|
printf @var{format}, @var{item1}, @var{item2}, @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
The entire list of arguments may optionally be enclosed in parentheses. The
|
|
parentheses are necessary if any of the item expressions use the @samp{>}
|
|
relational operator; otherwise, it can be confused with a redirection
|
|
(@pxref{Redirection}).
|
|
|
|
@cindex format strings
|
|
The difference between @code{printf} and @code{print} is the @var{format}
|
|
argument. This is an expression whose value is taken as a string; it
|
|
specifies how to output each of the other arguments. It is called the
|
|
@dfn{format string}.
|
|
|
|
The format string is very similar to that in the ISO C library function
|
|
@code{printf}. Most of @var{format} is text to output verbatim.
|
|
Scattered among this text are @dfn{format specifiers}---one per item.
|
|
Each format specifier says to output the next item in the argument list
|
|
at that place in the format.
|
|
|
|
The @code{printf} statement does not automatically append a newline
|
|
to its output. It outputs only what the format string specifies.
|
|
So if a newline is needed, you must include one in the format string.
|
|
The output separator variables @code{OFS} and @code{ORS} have no effect
|
|
on @code{printf} statements. For example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{
|
|
> ORS = "\nOUCH!\n"; OFS = "+"
|
|
> msg = "Dont Panic!"
|
|
> printf "%s\n", msg
|
|
> @}'
|
|
@print{} Dont Panic!
|
|
@end example
|
|
|
|
@noindent
|
|
Here, neither the @samp{+} nor the @samp{OUCH} appear when
|
|
the message is printed.
|
|
|
|
@node Control Letters
|
|
@subsection Format-Control Letters
|
|
@cindex @code{printf} statement, format-control characters
|
|
@cindex format specifiers, @code{printf} statement
|
|
|
|
A format specifier starts with the character @samp{%} and ends with
|
|
a @dfn{format-control letter}---it tells the @code{printf} statement
|
|
how to output one item. The format-control letter specifies what @emph{kind}
|
|
of value to print. The rest of the format specifier is made up of
|
|
optional @dfn{modifiers} that control @emph{how} to print the value, such as
|
|
the field width. Here is a list of the format-control letters:
|
|
|
|
@table @code
|
|
@item %c
|
|
This prints a number as an ASCII character; thus, @samp{printf "%c",
|
|
65} outputs the letter @samp{A}. (The output for a string value is
|
|
the first character of the string.)
|
|
|
|
@item %d@r{,} %i
|
|
These are equivalent; they both print a decimal integer.
|
|
(The @samp{%i} specification is for compatibility with ISO C.)
|
|
|
|
@item %e@r{,} %E
|
|
These print a number in scientific (exponential) notation;
|
|
for example:
|
|
|
|
@example
|
|
printf "%4.3e\n", 1950
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{1.950e+03}, with a total of four significant figures, three of
|
|
which follow the decimal point.
|
|
(The @samp{4.3} represents two modifiers,
|
|
discussed in the next @value{SUBSECTION}.)
|
|
@samp{%E} uses @samp{E} instead of @samp{e} in the output.
|
|
|
|
@item %f
|
|
This prints a number in floating-point notation.
|
|
For example:
|
|
|
|
@example
|
|
printf "%4.3f", 1950
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{1950.000}, with a total of four significant figures, three of
|
|
which follow the decimal point.
|
|
(The @samp{4.3} represents two modifiers,
|
|
discussed in the next @value{SUBSECTION}.)
|
|
|
|
@item %g@r{,} %G
|
|
These print a number in either scientific notation or in floating-point
|
|
notation, whichever uses fewer characters; if the result is printed in
|
|
scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}.
|
|
|
|
@item %o
|
|
This prints an unsigned octal integer.
|
|
|
|
@item %s
|
|
This prints a string.
|
|
|
|
@item %u
|
|
This prints an unsigned decimal integer.
|
|
(This format is of marginal use, because all numbers in @command{awk}
|
|
are floating-point; it is provided primarily for compatibility with C.)
|
|
|
|
@item %x@r{,} %X
|
|
These print an unsigned hexadecimal integer;
|
|
@samp{%X} uses the letters @samp{A} through @samp{F}
|
|
instead of @samp{a} through @samp{f}.
|
|
|
|
@item %%
|
|
This isn't a format-control letter, but it does have meaning---the
|
|
sequence @samp{%%} outputs one @samp{%}; it does not consume an
|
|
argument and it ignores any modifiers.
|
|
@end table
|
|
|
|
@cindex dark corner, format-control characters
|
|
@cindex @command{gawk}, format-control characters
|
|
@strong{Note:}
|
|
When using the integer format-control letters for values that are
|
|
outside the range of the widest C integer type, @command{gawk} switches to the
|
|
the @samp{%g} format specifier. If @option{--lint} is provided on the
|
|
command line (@pxref{Options}), @command{gawk}
|
|
warns about this. Other versions of @command{awk} may print invalid
|
|
values or do something else entirely.
|
|
@value{DARKCORNER}
|
|
|
|
@node Format Modifiers
|
|
@subsection Modifiers for @code{printf} Formats
|
|
|
|
@c STARTOFRANGE pfm
|
|
@cindex @code{printf} statement, modifiers
|
|
@c the comma here does NOT start a secondary
|
|
@cindex modifiers, in format specifiers
|
|
A format specification can also include @dfn{modifiers} that can control
|
|
how much of the item's value is printed, as well as how much space it gets.
|
|
The modifiers come between the @samp{%} and the format-control letter.
|
|
We will use the bullet symbol ``@bullet{}'' in the following examples to
|
|
represent
|
|
spaces in the output. Here are the possible modifiers, in the order in
|
|
which they may appear:
|
|
|
|
@table @code
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{print}/@code{printf} statements
|
|
@cindex @code{printf} statement, positional specifiers
|
|
@c the command does NOT start a secondary
|
|
@cindex positional specifiers, @code{printf} statement
|
|
@item @var{N}$
|
|
An integer constant followed by a @samp{$} is a @dfn{positional specifier}.
|
|
Normally, format specifications are applied to arguments in the order
|
|
given in the format string. With a positional specifier, the format
|
|
specification is applied to a specific argument, instead of what
|
|
would be the next argument in the list. Positional specifiers begin
|
|
counting with one. Thus:
|
|
|
|
@example
|
|
printf "%s %s\n", "don't", "panic"
|
|
printf "%2$s %1$s\n", "panic", "don't"
|
|
@end example
|
|
|
|
@noindent
|
|
prints the famous friendly message twice.
|
|
|
|
At first glance, this feature doesn't seem to be of much use.
|
|
It is in fact a @command{gawk} extension, intended for use in translating
|
|
messages at runtime.
|
|
@xref{Printf Ordering},
|
|
which describes how and why to use positional specifiers.
|
|
For now, we will not use them.
|
|
|
|
@item -
|
|
The minus sign, used before the width modifier (see later on in
|
|
this table),
|
|
says to left-justify
|
|
the argument within its specified width. Normally, the argument
|
|
is printed right-justified in the specified width. Thus:
|
|
|
|
@example
|
|
printf "%-4s", "foo"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{foo@bullet{}}.
|
|
|
|
@item @var{space}
|
|
For numeric conversions, prefix positive values with a space and
|
|
negative values with a minus sign.
|
|
|
|
@item +
|
|
The plus sign, used before the width modifier (see later on in
|
|
this table),
|
|
says to always supply a sign for numeric conversions, even if the data
|
|
to format is positive. The @samp{+} overrides the space modifier.
|
|
|
|
@item #
|
|
Use an ``alternate form'' for certain control letters.
|
|
For @samp{%o}, supply a leading zero.
|
|
For @samp{%x} and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for
|
|
a nonzero result.
|
|
For @samp{%e}, @samp{%E}, and @samp{%f}, the result always contains a
|
|
decimal point.
|
|
For @samp{%g} and @samp{%G}, trailing zeros are not removed from the result.
|
|
|
|
@cindex dark corner
|
|
@item 0
|
|
A leading @samp{0} (zero) acts as a flag that indicates that output should be
|
|
padded with zeros instead of spaces.
|
|
This applies even to non-numeric output formats.
|
|
@value{DARKCORNER}
|
|
This flag only has an effect when the field width is wider than the
|
|
value to print.
|
|
|
|
@item @var{width}
|
|
This is a number specifying the desired minimum width of a field. Inserting any
|
|
number between the @samp{%} sign and the format-control character forces the
|
|
field to expand to this width. The default way to do this is to
|
|
pad with spaces on the left. For example:
|
|
|
|
@example
|
|
printf "%4s", "foo"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{@bullet{}foo}.
|
|
|
|
The value of @var{width} is a minimum width, not a maximum. If the item
|
|
value requires more than @var{width} characters, it can be as wide as
|
|
necessary. Thus, the following:
|
|
|
|
@example
|
|
printf "%4s", "foobar"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{foobar}.
|
|
|
|
Preceding the @var{width} with a minus sign causes the output to be
|
|
padded with spaces on the right, instead of on the left.
|
|
|
|
@item .@var{prec}
|
|
A period followed by an integer constant
|
|
specifies the precision to use when printing.
|
|
The meaning of the precision varies by control letter:
|
|
|
|
@table @asis
|
|
@item @code{%e}, @code{%E}, @code{%f}
|
|
Number of digits to the right of the decimal point.
|
|
|
|
@item @code{%g}, @code{%G}
|
|
Maximum number of significant digits.
|
|
|
|
@item @code{%d}, @code{%i}, @code{%o}, @code{%u}, @code{%x}, @code{%X}
|
|
Minimum number of digits to print.
|
|
|
|
@item @code{%s}
|
|
Maximum number of characters from the string that should print.
|
|
@end table
|
|
|
|
Thus, the following:
|
|
|
|
@example
|
|
printf "%.4s", "foobar"
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{foob}.
|
|
@end table
|
|
|
|
The C library @code{printf}'s dynamic @var{width} and @var{prec}
|
|
capability (for example, @code{"%*.*s"}) is supported. Instead of
|
|
supplying explicit @var{width} and/or @var{prec} values in the format
|
|
string, they are passed in the argument list. For example:
|
|
|
|
@example
|
|
w = 5
|
|
p = 3
|
|
s = "abcdefg"
|
|
printf "%*.*s\n", w, p, s
|
|
@end example
|
|
|
|
@noindent
|
|
is exactly equivalent to:
|
|
|
|
@example
|
|
s = "abcdefg"
|
|
printf "%5.3s\n", s
|
|
@end example
|
|
|
|
@noindent
|
|
Both programs output @samp{@w{@bullet{}@bullet{}abc}}.
|
|
Earlier versions of @command{awk} did not support this capability.
|
|
If you must use such a version, you may simulate this feature by using
|
|
concatenation to build up the format string, like so:
|
|
|
|
@example
|
|
w = 5
|
|
p = 3
|
|
s = "abcdefg"
|
|
printf "%" w "." p "s\n", s
|
|
@end example
|
|
|
|
@noindent
|
|
This is not particularly easy to read but it does work.
|
|
|
|
@c @cindex lint checks
|
|
@cindex troubleshooting, fatal errors, @code{printf} format strings
|
|
@cindex POSIX @command{awk}, @code{printf} format strings and
|
|
C programmers may be used to supplying additional
|
|
@samp{l}, @samp{L}, and @samp{h}
|
|
modifiers in @code{printf} format strings. These are not valid in @command{awk}.
|
|
Most @command{awk} implementations silently ignore these modifiers.
|
|
If @option{--lint} is provided on the command line
|
|
(@pxref{Options}),
|
|
@command{gawk} warns about their use. If @option{--posix} is supplied,
|
|
their use is a fatal error.
|
|
@c ENDOFRANGE pfm
|
|
|
|
@node Printf Examples
|
|
@subsection Examples Using @code{printf}
|
|
|
|
The following is a simple example of
|
|
how to use @code{printf} to make an aligned table:
|
|
|
|
@example
|
|
awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
This command
|
|
prints the names of the bulletin boards (@code{$1}) in the file
|
|
@file{BBS-list} as a string of 10 characters that are left-justified. It also
|
|
prints the phone numbers (@code{$2}) next on the line. This
|
|
produces an aligned two-column table of names and phone numbers,
|
|
as shown here:
|
|
|
|
@example
|
|
$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@print{} aardvark 555-5553
|
|
@print{} alpo-net 555-3412
|
|
@print{} barfly 555-7685
|
|
@print{} bites 555-1675
|
|
@print{} camelot 555-0542
|
|
@print{} core 555-2912
|
|
@print{} fooey 555-1234
|
|
@print{} foot 555-6699
|
|
@print{} macfoo 555-6480
|
|
@print{} sdace 555-3430
|
|
@print{} sabafoo 555-2127
|
|
@end example
|
|
|
|
In this case, the phone numbers had to be printed as strings because
|
|
the numbers are separated by a dash. Printing the phone numbers as
|
|
numbers would have produced just the first three digits: @samp{555}.
|
|
This would have been pretty confusing.
|
|
|
|
It wasn't necessary to specify a width for the phone numbers because
|
|
they are last on their lines. They don't need to have spaces
|
|
after them.
|
|
|
|
The table could be made to look even nicer by adding headings to the
|
|
tops of the columns. This is done using the @code{BEGIN} pattern
|
|
(@pxref{BEGIN/END})
|
|
so that the headers are only printed once, at the beginning of
|
|
the @command{awk} program:
|
|
|
|
@example
|
|
awk 'BEGIN @{ print "Name Number"
|
|
print "---- ------" @}
|
|
@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@end example
|
|
|
|
The above example mixed @code{print} and @code{printf} statements in
|
|
the same program. Using just @code{printf} statements can produce the
|
|
same results:
|
|
|
|
@example
|
|
awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number"
|
|
printf "%-10s %s\n", "----", "------" @}
|
|
@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
Printing each column heading with the same format specification
|
|
used for the column elements ensures that the headings
|
|
are aligned just like the columns.
|
|
|
|
The fact that the same format specification is used three times can be
|
|
emphasized by storing it in a variable, like this:
|
|
|
|
@example
|
|
awk 'BEGIN @{ format = "%-10s %s\n"
|
|
printf format, "Name", "Number"
|
|
printf format, "----", "------" @}
|
|
@{ printf format, $1, $2 @}' BBS-list
|
|
@end example
|
|
|
|
@c !!! exercise
|
|
At this point, it would be a worthwhile exercise to use the
|
|
@code{printf} statement to line up the headings and table data for the
|
|
@file{inventory-shipped} example that was covered earlier in the @value{SECTION}
|
|
on the @code{print} statement
|
|
(@pxref{Print}).
|
|
@c ENDOFRANGE printfs
|
|
|
|
@node Redirection
|
|
@section Redirecting Output of @code{print} and @code{printf}
|
|
|
|
@cindex output redirection
|
|
@cindex redirection of output
|
|
So far, the output from @code{print} and @code{printf} has gone
|
|
to the standard
|
|
output, usually the terminal. Both @code{print} and @code{printf} can
|
|
also send their output to other places.
|
|
This is called @dfn{redirection}.
|
|
|
|
A redirection appears after the @code{print} or @code{printf} statement.
|
|
Redirections in @command{awk} are written just like redirections in shell
|
|
commands, except that they are written inside the @command{awk} program.
|
|
|
|
@c the commas here are part of the see also
|
|
@cindex @code{print} statement, See Also redirection, of output
|
|
@cindex @code{printf} statement, See Also redirection, of output
|
|
There are four forms of output redirection: output to a file, output
|
|
appended to a file, output through a pipe to another command, and output
|
|
to a coprocess. They are all shown for the @code{print} statement,
|
|
but they work identically for @code{printf}:
|
|
|
|
@table @code
|
|
@cindex @code{>} (right angle bracket), @code{>} operator (I/O)
|
|
@cindex right angle bracket (@code{>}), @code{>} operator (I/O)
|
|
@cindex operators, input/output
|
|
@item print @var{items} > @var{output-file}
|
|
This type of redirection prints the items into the output file named
|
|
@var{output-file}. The @value{FN} @var{output-file} can be any
|
|
expression. Its value is changed to a string and then used as a
|
|
@value{FN} (@pxref{Expressions}).
|
|
|
|
When this type of redirection is used, the @var{output-file} is erased
|
|
before the first output is written to it. Subsequent writes to the same
|
|
@var{output-file} do not erase @var{output-file}, but append to it.
|
|
(This is different from how you use redirections in shell scripts.)
|
|
If @var{output-file} does not exist, it is created. For example, here
|
|
is how an @command{awk} program can write a list of BBS names to one
|
|
file named @file{name-list}, and a list of phone numbers to another file
|
|
named @file{phone-list}:
|
|
|
|
@example
|
|
$ awk '@{ print $2 > "phone-list"
|
|
> print $1 > "name-list" @}' BBS-list
|
|
$ cat phone-list
|
|
@print{} 555-5553
|
|
@print{} 555-3412
|
|
@dots{}
|
|
$ cat name-list
|
|
@print{} aardvark
|
|
@print{} alpo-net
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
Each output file contains one name or number per line.
|
|
|
|
@cindex @code{>} (right angle bracket), @code{>>} operator (I/O)
|
|
@cindex right angle bracket (@code{>}), @code{>>} operator (I/O)
|
|
@item print @var{items} >> @var{output-file}
|
|
This type of redirection prints the items into the pre-existing output file
|
|
named @var{output-file}. The difference between this and the
|
|
single-@samp{>} redirection is that the old contents (if any) of
|
|
@var{output-file} are not erased. Instead, the @command{awk} output is
|
|
appended to the file.
|
|
If @var{output-file} does not exist, then it is created.
|
|
|
|
@cindex @code{|} (vertical bar), @code{|} operator (I/O)
|
|
@cindex pipes, output
|
|
@cindex output, pipes
|
|
@item print @var{items} | @var{command}
|
|
It is also possible to send output to another program through a pipe
|
|
instead of into a file. This type of redirection opens a pipe to
|
|
@var{command}, and writes the values of @var{items} through this pipe
|
|
to another process created to execute @var{command}.
|
|
|
|
The redirection argument @var{command} is actually an @command{awk}
|
|
expression. Its value is converted to a string whose contents give
|
|
the shell command to be run. For example, the following produces two
|
|
files, one unsorted list of BBS names, and one list sorted in reverse
|
|
alphabetical order:
|
|
|
|
@ignore
|
|
10/2000:
|
|
This isn't the best style, since COMMAND is assigned for each
|
|
record. It's done to avoid overfull hboxes in TeX. Leave it
|
|
alone for now and let's hope no-one notices.
|
|
@end ignore
|
|
|
|
@example
|
|
awk '@{ print $1 > "names.unsorted"
|
|
command = "sort -r > names.sorted"
|
|
print $1 | command @}' BBS-list
|
|
@end example
|
|
|
|
The unsorted list is written with an ordinary redirection, while
|
|
the sorted list is written by piping through the @command{sort} utility.
|
|
|
|
The next example uses redirection to mail a message to the mailing
|
|
list @samp{bug-system}. This might be useful when trouble is encountered
|
|
in an @command{awk} script run periodically for system maintenance:
|
|
|
|
@example
|
|
report = "mail bug-system"
|
|
print "Awk script failed:", $0 | report
|
|
m = ("at record number " FNR " of " FILENAME)
|
|
print m | report
|
|
close(report)
|
|
@end example
|
|
|
|
The message is built using string concatenation and saved in the variable
|
|
@code{m}. It's then sent down the pipeline to the @command{mail} program.
|
|
(The parentheses group the items to concatenate---see
|
|
@ref{Concatenation}.)
|
|
|
|
The @code{close} function is called here because it's a good idea to close
|
|
the pipe as soon as all the intended output has been sent to it.
|
|
@xref{Close Files And Pipes},
|
|
for more information.
|
|
|
|
This example also illustrates the use of a variable to represent
|
|
a @var{file} or @var{command}---it is not necessary to always
|
|
use a string constant. Using a variable is generally a good idea,
|
|
because @command{awk} requires that the string value be spelled identically
|
|
every time.
|
|
|
|
@cindex coprocesses
|
|
@cindex @code{|} (vertical bar), @code{|&} operator (I/O)
|
|
@cindex operators, input/output
|
|
@cindex differences in @command{awk} and @command{gawk}, input/output operators
|
|
@item print @var{items} |& @var{command}
|
|
This type of redirection prints the items to the input of @var{command}.
|
|
The difference between this and the
|
|
single-@samp{|} redirection is that the output from @var{command}
|
|
can be read with @code{getline}.
|
|
Thus @var{command} is a @dfn{coprocess}, which works together with,
|
|
but subsidiary to, the @command{awk} program.
|
|
|
|
This feature is a @command{gawk} extension, and is not available in
|
|
POSIX @command{awk}.
|
|
@xref{Two-way I/O},
|
|
for a more complete discussion.
|
|
@end table
|
|
|
|
Redirecting output using @samp{>}, @samp{>>}, @samp{|}, or @samp{|&}
|
|
asks the system to open a file, pipe, or coprocess only if the particular
|
|
@var{file} or @var{command} you specify has not already been written
|
|
to by your program or if it has been closed since it was last written to.
|
|
|
|
@cindex troubleshooting, printing
|
|
It is a common error to use @samp{>} redirection for the first @code{print}
|
|
to a file, and then to use @samp{>>} for subsequent output:
|
|
|
|
@example
|
|
# clear the file
|
|
print "Don't panic" > "guide.txt"
|
|
@dots{}
|
|
# append
|
|
print "Avoid improbability generators" >> "guide.txt"
|
|
@end example
|
|
|
|
@noindent
|
|
This is indeed how redirections must be used from the shell. But in
|
|
@command{awk}, it isn't necessary. In this kind of case, a program should
|
|
use @samp{>} for all the @code{print} statements, since the output file
|
|
is only opened once.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, implementation limitations
|
|
@c the comma here does NOT start a secondary
|
|
@cindex implementation issues, @command{gawk}, limits
|
|
@cindex @command{awk}, implementation issues, pipes
|
|
@cindex @command{gawk}, implementation issues, pipes
|
|
@ifnotinfo
|
|
As mentioned earlier
|
|
(@pxref{Getline Notes}),
|
|
many
|
|
@end ifnotinfo
|
|
@ifnottex
|
|
Many
|
|
@end ifnottex
|
|
@command{awk} implementations limit the number of pipelines that an @command{awk}
|
|
program may have open to just one! In @command{gawk}, there is no such limit.
|
|
@command{gawk} allows a program to
|
|
open as many pipelines as the underlying operating system permits.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Piping into @command{sh}
|
|
@cindex advanced features, piping into @command{sh}
|
|
@cindex shells, piping commands into
|
|
|
|
A particularly powerful way to use redirection is to build command lines
|
|
and pipe them into the shell, @command{sh}. For example, suppose you
|
|
have a list of files brought over from a system where all the @value{FN}s
|
|
are stored in uppercase, and you wish to rename them to have names in
|
|
all lowercase. The following program is both simple and efficient:
|
|
|
|
@c @cindex @command{mv} utility
|
|
@example
|
|
@{ printf("mv %s %s\n", $0, tolower($0)) | "sh" @}
|
|
|
|
END @{ close("sh") @}
|
|
@end example
|
|
|
|
The @code{tolower} function returns its argument string with all
|
|
uppercase characters converted to lowercase
|
|
(@pxref{String Functions}).
|
|
The program builds up a list of command lines,
|
|
using the @command{mv} utility to rename the files.
|
|
It then sends the list to the shell for execution.
|
|
@c ENDOFRANGE outre
|
|
@c ENDOFRANGE reout
|
|
|
|
@node Special Files
|
|
@section Special @value{FFN}s in @command{gawk}
|
|
@c STARTOFRANGE gfn
|
|
@cindex @command{gawk}, @value{FN}s in
|
|
|
|
@command{gawk} provides a number of special @value{FN}s that it interprets
|
|
internally. These @value{FN}s provide access to standard file descriptors,
|
|
process-related information, and TCP/IP networking.
|
|
|
|
@menu
|
|
* Special FD:: Special files for I/O.
|
|
* Special Process:: Special files for process information.
|
|
* Special Network:: Special files for network communications.
|
|
* Special Caveats:: Things to watch out for.
|
|
@end menu
|
|
|
|
@node Special FD
|
|
@subsection Special Files for Standard Descriptors
|
|
@cindex standard input
|
|
@cindex input, standard
|
|
@cindex standard output
|
|
@cindex output, standard
|
|
@cindex error output
|
|
@cindex file descriptors
|
|
@cindex files, descriptors, See file descriptors
|
|
|
|
Running programs conventionally have three input and output streams
|
|
already available to them for reading and writing. These are known as
|
|
the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error
|
|
output}. These streams are, by default, connected to your terminal, but
|
|
they are often redirected with the shell, via the @samp{<}, @samp{<<},
|
|
@samp{>}, @samp{>>}, @samp{>&}, and @samp{|} operators. Standard error
|
|
is typically used for writing error messages; the reason there are two separate
|
|
streams, standard output and standard error, is so that they can be
|
|
redirected separately.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, error messages
|
|
@cindex error handling
|
|
In other implementations of @command{awk}, the only way to write an error
|
|
message to standard error in an @command{awk} program is as follows:
|
|
|
|
@example
|
|
print "Serious error detected!" | "cat 1>&2"
|
|
@end example
|
|
|
|
@noindent
|
|
This works by opening a pipeline to a shell command that can access the
|
|
standard error stream that it inherits from the @command{awk} process.
|
|
This is far from elegant, and it is also inefficient, because it requires a
|
|
separate process. So people writing @command{awk} programs often
|
|
don't do this. Instead, they send the error messages to the
|
|
terminal, like this:
|
|
|
|
@example
|
|
print "Serious error detected!" > "/dev/tty"
|
|
@end example
|
|
|
|
@noindent
|
|
This usually has the same effect but not always: although the
|
|
standard error stream is usually the terminal, it can be redirected; when
|
|
that happens, writing to the terminal is not correct. In fact, if
|
|
@command{awk} is run from a background job, it may not have a terminal at all.
|
|
Then opening @file{/dev/tty} fails.
|
|
|
|
@command{gawk} provides special @value{FN}s for accessing the three standard
|
|
streams, as well as any other inherited open files. If the @value{FN} matches
|
|
one of these special names when @command{gawk} redirects input or output,
|
|
then it directly uses the stream that the @value{FN} stands for.
|
|
These special @value{FN}s work for all operating systems that @command{gawk}
|
|
has been ported to, not just those that are POSIX-compliant:
|
|
|
|
@cindex @value{FN}s, standard streams in @command{gawk}
|
|
@cindex @code{/dev/@dots{}} special files (@command{gawk})
|
|
@cindex files, @code{/dev/@dots{}} special files
|
|
@c @cindex @code{/dev/stdin} special file
|
|
@c @cindex @code{/dev/stdout} special file
|
|
@c @cindex @code{/dev/stderr} special file
|
|
@c @cindex @code{/dev/fd} special files
|
|
@table @file
|
|
@item /dev/stdin
|
|
The standard input (file descriptor 0).
|
|
|
|
@item /dev/stdout
|
|
The standard output (file descriptor 1).
|
|
|
|
@item /dev/stderr
|
|
The standard error output (file descriptor 2).
|
|
|
|
@item /dev/fd/@var{N}
|
|
The file associated with file descriptor @var{N}. Such a file must
|
|
be opened by the program initiating the @command{awk} execution (typically
|
|
the shell). Unless special pains are taken in the shell from which
|
|
@command{gawk} is invoked, only descriptors 0, 1, and 2 are available.
|
|
@end table
|
|
|
|
The @value{FN}s @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}
|
|
are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2},
|
|
respectively. However, they are more self-explanatory.
|
|
The proper way to write an error message in a @command{gawk} program
|
|
is to use @file{/dev/stderr}, like this:
|
|
|
|
@example
|
|
print "Serious error detected!" > "/dev/stderr"
|
|
@end example
|
|
|
|
@cindex troubleshooting, quotes with @value{FN}s
|
|
Note the use of quotes around the @value{FN}.
|
|
Like any other redirection, the value must be a string.
|
|
It is a common error to omit the quotes, which leads
|
|
to confusing results.
|
|
@c Exercise: What does it do? :-)
|
|
|
|
@node Special Process
|
|
@subsection Special Files for Process-Related Information
|
|
|
|
@cindex files, for process information
|
|
@cindex process information, files for
|
|
@command{gawk} also provides special @value{FN}s that give access to information
|
|
about the running @command{gawk} process. Each of these ``files'' provides
|
|
a single record of information. To read them more than once, they must
|
|
first be closed with the @code{close} function
|
|
(@pxref{Close Files And Pipes}).
|
|
The @value{FN}s are:
|
|
|
|
@c @cindex @code{/dev/pid} special file
|
|
@c @cindex @code{/dev/pgrpid} special file
|
|
@c @cindex @code{/dev/ppid} special file
|
|
@c @cindex @code{/dev/user} special file
|
|
@table @file
|
|
@item /dev/pid
|
|
Reading this file returns the process ID of the current process,
|
|
in decimal form, terminated with a newline.
|
|
|
|
@item /dev/ppid
|
|
Reading this file returns the parent process ID of the current process,
|
|
in decimal form, terminated with a newline.
|
|
|
|
@item /dev/pgrpid
|
|
Reading this file returns the process group ID of the current process,
|
|
in decimal form, terminated with a newline.
|
|
|
|
@item /dev/user
|
|
Reading this file returns a single record terminated with a newline.
|
|
The fields are separated with spaces. The fields represent the
|
|
following information:
|
|
|
|
@table @code
|
|
@item $1
|
|
The return value of the @code{getuid} system call
|
|
(the real user ID number).
|
|
|
|
@item $2
|
|
The return value of the @code{geteuid} system call
|
|
(the effective user ID number).
|
|
|
|
@item $3
|
|
The return value of the @code{getgid} system call
|
|
(the real group ID number).
|
|
|
|
@item $4
|
|
The return value of the @code{getegid} system call
|
|
(the effective group ID number).
|
|
@end table
|
|
|
|
If there are any additional fields, they are the group IDs returned by
|
|
the @code{getgroups} system call.
|
|
(Multiple groups may not be supported on all systems.)
|
|
@end table
|
|
|
|
These special @value{FN}s may be used on the command line as @value{DF}s,
|
|
as well as for I/O redirections within an @command{awk} program.
|
|
They may not be used as source files with the @option{-f} option.
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
@strong{Note:}
|
|
The special files that provide process-related information are now considered
|
|
obsolete and will disappear entirely
|
|
in the next release of @command{gawk}.
|
|
@command{gawk} prints a warning message every time you use one of
|
|
these files.
|
|
To obtain process-related information, use the @code{PROCINFO} array.
|
|
@xref{Auto-set}.
|
|
|
|
@node Special Network
|
|
@subsection Special Files for Network Communications
|
|
@cindex networks, support for
|
|
@cindex TCP/IP, support for
|
|
|
|
Starting with @value{PVERSION} 3.1 of @command{gawk}, @command{awk} programs
|
|
can open a two-way
|
|
TCP/IP connection, acting as either a client or a server.
|
|
This is done using a special @value{FN} of the form:
|
|
|
|
@example
|
|
@file{/inet/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}
|
|
@end example
|
|
|
|
The @var{protocol} is one of @samp{tcp}, @samp{udp}, or @samp{raw},
|
|
and the other fields represent the other essential pieces of information
|
|
for making a networking connection.
|
|
These @value{FN}s are used with the @samp{|&} operator for communicating
|
|
with a coprocess
|
|
(@pxref{Two-way I/O}).
|
|
This is an advanced feature, mentioned here only for completeness.
|
|
Full discussion is delayed until
|
|
@ref{TCP/IP Networking}.
|
|
|
|
@node Special Caveats
|
|
@subsection Special @value{FFN} Caveats
|
|
|
|
Here is a list of things to bear in mind when using the
|
|
special @value{FN}s that @command{gawk} provides:
|
|
|
|
@itemize @bullet
|
|
@cindex compatibility mode (@command{gawk}), @value{FN}s
|
|
@cindex @value{FN}s, in compatibility mode
|
|
@item
|
|
Recognition of these special @value{FN}s is disabled if @command{gawk} is in
|
|
compatibility mode (@pxref{Options}).
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
@cindex @code{PROCINFO} array
|
|
@item
|
|
@ifnottex
|
|
The
|
|
@end ifnottex
|
|
@ifnotinfo
|
|
As mentioned earlier, the
|
|
@end ifnotinfo
|
|
special files that provide process-related information are now considered
|
|
obsolete and will disappear entirely
|
|
in the next release of @command{gawk}.
|
|
@command{gawk} prints a warning message every time you use one of
|
|
these files.
|
|
@ifnottex
|
|
To obtain process-related information, use the @code{PROCINFO} array.
|
|
@xref{Built-in Variables}.
|
|
@end ifnottex
|
|
|
|
@item
|
|
Starting with @value{PVERSION} 3.1, @command{gawk} @emph{always}
|
|
interprets these special @value{FN}s.@footnote{Older versions of
|
|
@command{gawk} would interpret these names internally only if the system
|
|
did not actually have a @file{/dev/fd} directory or any of the other
|
|
special files listed earlier. Usually this didn't make a difference,
|
|
but sometimes it did; thus, it was decided to make @command{gawk}'s
|
|
behavior consistent on all systems and to have it always interpret
|
|
the special @value{FN}s itself.}
|
|
For example, using @samp{/dev/fd/4}
|
|
for output actually writes on file descriptor 4, and not on a new
|
|
file descriptor that is @code{dup}'ed from file descriptor 4. Most of
|
|
the time this does not matter; however, it is important to @emph{not}
|
|
close any of the files related to file descriptors 0, 1, and 2.
|
|
Doing so results in unpredictable behavior.
|
|
@end itemize
|
|
@c ENDOFRANGE gfn
|
|
|
|
@node Close Files And Pipes
|
|
@section Closing Input and Output Redirections
|
|
@cindex files, output, See output files
|
|
@c STARTOFRANGE ifc
|
|
@cindex input files, closing
|
|
@c comma before closing is NOT start of tertiary
|
|
@c STARTOFRANGE ofc
|
|
@cindex output, files, closing
|
|
@c STARTOFRANGE pc
|
|
@cindex pipes, closing
|
|
@c STARTOFRANGE cc
|
|
@cindex coprocesses, closing
|
|
@c comma before using is NOT start of tertiary
|
|
@cindex @code{getline} command, coprocesses, using from
|
|
|
|
If the same @value{FN} or the same shell command is used with @code{getline}
|
|
more than once during the execution of an @command{awk} program
|
|
(@pxref{Getline}),
|
|
the file is opened (or the command is executed) the first time only.
|
|
At that time, the first record of input is read from that file or command.
|
|
The next time the same file or command is used with @code{getline},
|
|
another record is read from it, and so on.
|
|
|
|
Similarly, when a file or pipe is opened for output, the @value{FN} or
|
|
command associated with it is remembered by @command{awk}, and subsequent
|
|
writes to the same file or command are appended to the previous writes.
|
|
The file or pipe stays open until @command{awk} exits.
|
|
|
|
@cindex @code{close} function
|
|
This implies that special steps are necessary in order to read the same
|
|
file again from the beginning, or to rerun a shell command (rather than
|
|
reading more output from the same command). The @code{close} function
|
|
makes these things possible:
|
|
|
|
@example
|
|
close(@var{filename})
|
|
@end example
|
|
|
|
@noindent
|
|
or:
|
|
|
|
@example
|
|
close(@var{command})
|
|
@end example
|
|
|
|
The argument @var{filename} or @var{command} can be any expression. Its
|
|
value must @emph{exactly} match the string that was used to open the file or
|
|
start the command (spaces and other ``irrelevant'' characters
|
|
included). For example, if you open a pipe with this:
|
|
|
|
@example
|
|
"sort -r names" | getline foo
|
|
@end example
|
|
|
|
@noindent
|
|
then you must close it with this:
|
|
|
|
@example
|
|
close("sort -r names")
|
|
@end example
|
|
|
|
Once this function call is executed, the next @code{getline} from that
|
|
file or command, or the next @code{print} or @code{printf} to that
|
|
file or command, reopens the file or reruns the command.
|
|
Because the expression that you use to close a file or pipeline must
|
|
exactly match the expression used to open the file or run the command,
|
|
it is good practice to use a variable to store the @value{FN} or command.
|
|
The previous example becomes the following:
|
|
|
|
@example
|
|
sortcom = "sort -r names"
|
|
sortcom | getline foo
|
|
@dots{}
|
|
close(sortcom)
|
|
@end example
|
|
|
|
@noindent
|
|
This helps avoid hard-to-find typographical errors in your @command{awk}
|
|
programs. Here are some of the reasons for closing an output file:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
To write a file and read it back later on in the same @command{awk}
|
|
program. Close the file after writing it, then
|
|
begin reading it with @code{getline}.
|
|
|
|
@item
|
|
To write numerous files, successively, in the same @command{awk}
|
|
program. If the files aren't closed, eventually @command{awk} may exceed a
|
|
system limit on the number of open files in one process. It is best to
|
|
close each one when the program has finished writing it.
|
|
|
|
@item
|
|
To make a command finish. When output is redirected through a pipe,
|
|
the command reading the pipe normally continues to try to read input
|
|
as long as the pipe is open. Often this means the command cannot
|
|
really do its work until the pipe is closed. For example, if
|
|
output is redirected to the @command{mail} program, the message is not
|
|
actually sent until the pipe is closed.
|
|
|
|
@item
|
|
To run the same program a second time, with the same arguments.
|
|
This is not the same thing as giving more input to the first run!
|
|
|
|
For example, suppose a program pipes output to the @command{mail} program.
|
|
If it outputs several lines redirected to this pipe without closing
|
|
it, they make a single message of several lines. By contrast, if the
|
|
program closes the pipe after each line of output, then each line makes
|
|
a separate message.
|
|
@end itemize
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{close} function
|
|
@cindex portability, @code{close} function and
|
|
If you use more files than the system allows you to have open,
|
|
@command{gawk} attempts to multiplex the available open files among
|
|
your @value{DF}s. @command{gawk}'s ability to do this depends upon the
|
|
facilities of your operating system, so it may not always work. It is
|
|
therefore both good practice and good portability advice to always
|
|
use @code{close} on your files when you are done with them.
|
|
In fact, if you are using a lot of pipes, it is essential that
|
|
you close commands when done. For example, consider something like this:
|
|
|
|
@example
|
|
@{
|
|
@dots{}
|
|
command = ("grep " $1 " /some/file | my_prog -q " $3)
|
|
while ((command | getline) > 0) @{
|
|
@var{process output of} command
|
|
@}
|
|
# need close(command) here
|
|
@}
|
|
@end example
|
|
|
|
This example creates a new pipeline based on data in @emph{each} record.
|
|
Without the call to @code{close} indicated in the comment, @command{awk}
|
|
creates child processes to run the commands, until it eventually
|
|
runs out of file descriptors for more pipelines.
|
|
|
|
Even though each command has finished (as indicated by the end-of-file
|
|
return status from @code{getline}), the child process is not
|
|
terminated;@footnote{The technical terminology is rather morbid.
|
|
The finished child is called a ``zombie,'' and cleaning up after
|
|
it is referred to as ``reaping.''}
|
|
@c Good old UNIX: give the marketing guys fits, that's the ticket
|
|
more importantly, the file descriptor for the pipe
|
|
is not closed and released until @code{close} is called or
|
|
@command{awk} exits.
|
|
|
|
@code{close} will silently do nothing if given an argument that
|
|
does not represent a file, pipe or coprocess that was opened with
|
|
a redirection.
|
|
|
|
Note also that @samp{close(FILENAME)} has no
|
|
``magic'' effects on the implicit loop that reads through the
|
|
files named on the command line. It is, more likely, a close
|
|
of a file that was never opened, so @command{awk} silently
|
|
does nothing.
|
|
|
|
@c comma is part of tertiary
|
|
@cindex @code{|} (vertical bar), @code{|&} operator (I/O), pipes, closing
|
|
When using the @samp{|&} operator to communicate with a coprocess,
|
|
it is occasionally useful to be able to close one end of the two-way
|
|
pipe without closing the other.
|
|
This is done by supplying a second argument to @code{close}.
|
|
As in any other call to @code{close},
|
|
the first argument is the name of the command or special file used
|
|
to start the coprocess.
|
|
The second argument should be a string, with either of the values
|
|
@code{"to"} or @code{"from"}. Case does not matter.
|
|
As this is an advanced feature, a more complete discussion is
|
|
delayed until
|
|
@ref{Two-way I/O},
|
|
which discusses it in more detail and gives an example.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Using @code{close}'s Return Value
|
|
@cindex advanced features, @code{close} function
|
|
@cindex dark corner, @code{close} function
|
|
@cindex @code{close} function, return values
|
|
@c comma does NOT start secondary
|
|
@cindex return values, @code{close} function
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{close} function
|
|
@cindex Unix @command{awk}, @code{close} function and
|
|
|
|
In many versions of Unix @command{awk}, the @code{close} function
|
|
is actually a statement. It is a syntax error to try and use the return
|
|
value from @code{close}:
|
|
@value{DARKCORNER}
|
|
|
|
@example
|
|
command = "@dots{}"
|
|
command | getline info
|
|
retval = close(command) # syntax error in most Unix awks
|
|
@end example
|
|
|
|
@command{gawk} treats @code{close} as a function.
|
|
The return value is @minus{}1 if the argument names something
|
|
that was never opened with a redirection, or if there is
|
|
a system problem closing the file or process.
|
|
In these cases, @command{gawk} sets the built-in variable
|
|
@code{ERRNO} to a string describing the problem.
|
|
|
|
In @command{gawk},
|
|
when closing a pipe or coprocess,
|
|
the return value is the exit status of the command.@footnote{
|
|
This is a full 16-bit value as returned by the @code{wait}
|
|
system call. See the system manual pages for information on
|
|
how to decode this value.}
|
|
Otherwise, it is the return value from the system's @code{close} or
|
|
@code{fclose} C functions when closing input or output
|
|
files, respectively.
|
|
This value is zero if the close succeeds, or @minus{}1 if
|
|
it fails.
|
|
|
|
The POSIX standard is very vague; it says that @code{close}
|
|
returns zero on success and non-zero otherwise. In general,
|
|
different implementations vary in what they report when closing
|
|
pipes; thus the return value cannot be used portably.
|
|
@value{DARKCORNER}
|
|
|
|
@ignore
|
|
@c 4/27/2003: Commenting this out for now, given the above
|
|
@c return of 16-bit value
|
|
The return value for closing a pipeline is particularly useful.
|
|
It allows you to get the output from a command as well as its
|
|
exit status.
|
|
@c 8/21/2002, FIXME: Maybe the code and this doc should be adjusted to
|
|
@c create values indicating death-by-signal? Sigh.
|
|
|
|
@cindex pipes, closing
|
|
@c comma does NOT start tertiary
|
|
@cindex POSIX @command{awk}, pipes, closing
|
|
For POSIX-compliant systems,
|
|
if the exit status is a number above 128, then the program
|
|
was terminated by a signal. Subtract 128 to get the signal number:
|
|
|
|
@example
|
|
exit_val = close(command)
|
|
if (exit_val > 128)
|
|
print command, "died with signal", exit_val - 128
|
|
else
|
|
print command, "exited with code", exit_val
|
|
@end example
|
|
|
|
Currently, in @command{gawk}, this only works for commands
|
|
piping into @code{getline}. For commands piped into
|
|
from @code{print} or @code{printf}, the
|
|
return value from @code{close} is that of the library's
|
|
@code{pclose} function.
|
|
@end ignore
|
|
@c ENDOFRANGE ifc
|
|
@c ENDOFRANGE ofc
|
|
@c ENDOFRANGE pc
|
|
@c ENDOFRANGE cc
|
|
@c ENDOFRANGE prnt
|
|
|
|
@node Expressions
|
|
@chapter Expressions
|
|
@c STARTOFRANGE exps
|
|
@cindex expressions
|
|
|
|
Expressions are the basic building blocks of @command{awk} patterns
|
|
and actions. An expression evaluates to a value that you can print, test,
|
|
or pass to a function. Additionally, an expression
|
|
can assign a new value to a variable or a field by using an assignment operator.
|
|
|
|
An expression can serve as a pattern or action statement on its own.
|
|
Most other kinds of
|
|
statements contain one or more expressions that specify the data on which to
|
|
operate. As in other languages, expressions in @command{awk} include
|
|
variables, array references, constants, and function calls, as well as
|
|
combinations of these with various operators.
|
|
|
|
@menu
|
|
* Constants:: String, numeric and regexp constants.
|
|
* Using Constant Regexps:: When and how to use a regexp constant.
|
|
* Variables:: Variables give names to values for later use.
|
|
* Conversion:: The conversion of strings to numbers and vice
|
|
versa.
|
|
* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
|
|
etc.)
|
|
* Concatenation:: Concatenating strings.
|
|
* Assignment Ops:: Changing the value of a variable or a field.
|
|
* Increment Ops:: Incrementing the numeric value of a variable.
|
|
* Truth Values:: What is ``true'' and what is ``false''.
|
|
* Typing and Comparison:: How variables acquire types and how this
|
|
affects comparison of numbers and strings with
|
|
@samp{<}, etc.
|
|
* Boolean Ops:: Combining comparison expressions using boolean
|
|
operators @samp{||} (``or''), @samp{&&}
|
|
(``and'') and @samp{!} (``not'').
|
|
* Conditional Exp:: Conditional expressions select between two
|
|
subexpressions under control of a third
|
|
subexpression.
|
|
* Function Calls:: A function call is an expression.
|
|
* Precedence:: How various operators nest.
|
|
@end menu
|
|
|
|
@node Constants
|
|
@section Constant Expressions
|
|
@cindex constants, types of
|
|
|
|
The simplest type of expression is the @dfn{constant}, which always has
|
|
the same value. There are three types of constants: numeric,
|
|
string, and regular expression.
|
|
|
|
Each is used in the appropriate context when you need a data
|
|
value that isn't going to change. Numeric constants can
|
|
have different forms, but are stored identically internally.
|
|
|
|
@menu
|
|
* Scalar Constants:: Numeric and string constants.
|
|
* Nondecimal-numbers:: What are octal and hex numbers.
|
|
* Regexp Constants:: Regular Expression constants.
|
|
@end menu
|
|
|
|
@node Scalar Constants
|
|
@subsection Numeric and String Constants
|
|
|
|
@cindex numeric, constants
|
|
A @dfn{numeric constant} stands for a number. This number can be an
|
|
integer, a decimal fraction, or a number in scientific (exponential)
|
|
notation.@footnote{The internal representation of all numbers,
|
|
including integers, uses double-precision
|
|
floating-point numbers.
|
|
On most modern systems, these are in IEEE 754 standard format.}
|
|
Here are some examples of numeric constants that all
|
|
have the same value:
|
|
|
|
@example
|
|
105
|
|
1.05e+2
|
|
1050e-1
|
|
@end example
|
|
|
|
@cindex string constants
|
|
A string constant consists of a sequence of characters enclosed in
|
|
double-quotation marks. For example:
|
|
|
|
@example
|
|
"parrot"
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex differences in @command{awk} and @command{gawk}, strings
|
|
@cindex strings, length of
|
|
represents the string whose contents are @samp{parrot}. Strings in
|
|
@command{gawk} can be of any length, and they can contain any of the possible
|
|
eight-bit ASCII characters including ASCII @sc{nul} (character code zero).
|
|
Other @command{awk}
|
|
implementations may have difficulty with some character codes.
|
|
|
|
@node Nondecimal-numbers
|
|
@subsection Octal and Hexadecimal Numbers
|
|
@cindex octal numbers
|
|
@cindex hexadecimal numbers
|
|
@cindex numbers, octal
|
|
@cindex numbers, hexadecimal
|
|
|
|
In @command{awk}, all numbers are in decimal; i.e., base 10. Many other
|
|
programming languages allow you to specify numbers in other bases, often
|
|
octal (base 8) and hexadecimal (base 16).
|
|
In octal, the numbers go 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, etc.
|
|
Just as @samp{11}, in decimal, is 1 times 10 plus 1, so
|
|
@samp{11}, in octal, is 1 times 8, plus 1. This equals 9 in decimal.
|
|
In hexadecimal, there are 16 digits. Since the everyday decimal
|
|
number system only has ten digits (@samp{0}--@samp{9}), the letters
|
|
@samp{a} through @samp{f} are used to represent the rest.
|
|
(Case in the letters is usually irrelevant; hexadecimal @samp{a} and @samp{A}
|
|
have the same value.)
|
|
Thus, @samp{11}, in
|
|
hexadecimal, is 1 times 16 plus 1, which equals 17 in decimal.
|
|
|
|
Just by looking at plain @samp{11}, you can't tell what base it's in.
|
|
So, in C, C++, and other languages derived from C,
|
|
@c such as PERL, but we won't mention that....
|
|
there is a special notation to help signify the base.
|
|
Octal numbers start with a leading @samp{0},
|
|
and hexadecimal numbers start with a leading @samp{0x} or @samp{0X}:
|
|
|
|
@table @code
|
|
@item 11
|
|
Decimal value 11.
|
|
|
|
@item 011
|
|
Octal 11, decimal value 9.
|
|
|
|
@item 0x11
|
|
Hexadecimal 11, decimal value 17.
|
|
@end table
|
|
|
|
This example shows the difference:
|
|
|
|
@example
|
|
$ gawk 'BEGIN @{ printf "%d, %d, %d\n", 011, 11, 0x11 @}'
|
|
@print{} 9, 11, 17
|
|
@end example
|
|
|
|
Being able to use octal and hexadecimal constants in your programs is most
|
|
useful when working with data that cannot be represented conveniently as
|
|
characters or as regular numbers, such as binary data of various sorts.
|
|
|
|
@cindex @command{gawk}, octal numbers and
|
|
@cindex @command{gawk}, hexadecimal numbers and
|
|
@command{gawk} allows the use of octal and hexadecimal
|
|
constants in your program text. However, such numbers in the input data
|
|
are not treated differently; doing so by default would break old
|
|
programs.
|
|
(If you really need to do this, use the @option{--non-decimal-data}
|
|
command-line option;
|
|
@pxref{Nondecimal Data}.)
|
|
If you have octal or hexadecimal data,
|
|
you can use the @code{strtonum} function
|
|
(@pxref{String Functions})
|
|
to convert the data into a number.
|
|
Most of the time, you will want to use octal or hexadecimal constants
|
|
when working with the built-in bit manipulation functions;
|
|
see @ref{Bitwise Functions},
|
|
for more information.
|
|
|
|
Unlike some early C implementations, @samp{8} and @samp{9} are not valid
|
|
in octal constants; e.g., @command{gawk} treats @samp{018} as decimal 18:
|
|
|
|
@example
|
|
$ gawk 'BEGIN @{ print "021 is", 021 ; print 018 @}'
|
|
@print{} 021 is 17
|
|
@print{} 18
|
|
@end example
|
|
|
|
@cindex compatibility mode (@command{gawk}), octal numbers
|
|
@cindex compatibility mode (@command{gawk}), hexadecimal numbers
|
|
Octal and hexadecimal source code constants are a @command{gawk} extension.
|
|
If @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
they are not available.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: A Constant's Base Does Not Affect Its Value
|
|
@c comma before values does NOT start tertiary
|
|
@cindex advanced features, constants, values of
|
|
|
|
Once a numeric constant has
|
|
been converted internally into a number,
|
|
@command{gawk} no longer remembers
|
|
what the original form of the constant was; the internal value is
|
|
always used. This has particular consequences for conversion of
|
|
numbers to strings:
|
|
|
|
@example
|
|
$ gawk 'BEGIN @{ printf "0x11 is <%s>\n", 0x11 @}'
|
|
@print{} 0x11 is <17>
|
|
@end example
|
|
|
|
@node Regexp Constants
|
|
@subsection Regular Expression Constants
|
|
|
|
@c STARTOFRANGE rec
|
|
@cindex regexp constants
|
|
@cindex @code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
A regexp constant is a regular expression description enclosed in
|
|
slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in
|
|
@command{awk} programs are constant, but the @samp{~} and @samp{!~}
|
|
matching operators can also match computed or ``dynamic'' regexps
|
|
(which are just ordinary strings or variables that contain a regexp).
|
|
@c ENDOFRANGE cnst
|
|
|
|
@node Using Constant Regexps
|
|
@section Using Regular Expression Constants
|
|
|
|
@cindex dark corner, regexp constants
|
|
When used on the righthand side of the @samp{~} or @samp{!~}
|
|
operators, a regexp constant merely stands for the regexp that is to be
|
|
matched.
|
|
However, regexp constants (such as @code{/foo/}) may be used like simple expressions.
|
|
When a
|
|
regexp constant appears by itself, it has the same meaning as if it appeared
|
|
in a pattern, i.e., @samp{($0 ~ /foo/)}
|
|
@value{DARKCORNER}
|
|
@xref{Expression Patterns}.
|
|
This means that the following two code segments:
|
|
|
|
@example
|
|
if ($0 ~ /barfly/ || $0 ~ /camelot/)
|
|
print "found"
|
|
@end example
|
|
|
|
@noindent
|
|
and:
|
|
|
|
@example
|
|
if (/barfly/ || /camelot/)
|
|
print "found"
|
|
@end example
|
|
|
|
@noindent
|
|
are exactly equivalent.
|
|
One rather bizarre consequence of this rule is that the following
|
|
Boolean expression is valid, but does not do what the user probably
|
|
intended:
|
|
|
|
@example
|
|
# note that /foo/ is on the left of the ~
|
|
if (/foo/ ~ $1) print "found foo"
|
|
@end example
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
@cindex @command{gawk}, regexp constants and
|
|
@cindex regexp constants, in @command{gawk}
|
|
@noindent
|
|
This code is ``obviously'' testing @code{$1} for a match against the regexp
|
|
@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means
|
|
@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record
|
|
against the regexp @code{/foo/}. The result is either zero or one,
|
|
depending upon the success or failure of the match. That result
|
|
is then matched against the first field in the record.
|
|
Because it is unlikely that you would ever really want to make this kind of
|
|
test, @command{gawk} issues a warning when it sees this construct in
|
|
a program.
|
|
Another consequence of this rule is that the assignment statement:
|
|
|
|
@example
|
|
matches = /foo/
|
|
@end example
|
|
|
|
@noindent
|
|
assigns either zero or one to the variable @code{matches}, depending
|
|
upon the contents of the current input record.
|
|
This feature of the language has never been well documented until the
|
|
POSIX specification.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, regexp constants
|
|
@cindex dark corner, regexp constants, as arguments to user-defined functions
|
|
@cindex @code{gensub} function (@command{gawk})
|
|
@cindex @code{sub} function
|
|
@cindex @code{gsub} function
|
|
Constant regular expressions are also used as the first argument for
|
|
the @code{gensub}, @code{sub}, and @code{gsub} functions, and as the
|
|
second argument of the @code{match} function
|
|
(@pxref{String Functions}).
|
|
Modern implementations of @command{awk}, including @command{gawk}, allow
|
|
the third argument of @code{split} to be a regexp constant, but some
|
|
older implementations do not.
|
|
@value{DARKCORNER}
|
|
This can lead to confusion when attempting to use regexp constants
|
|
as arguments to user-defined functions
|
|
(@pxref{User-defined}).
|
|
For example:
|
|
|
|
@example
|
|
function mysub(pat, repl, str, global)
|
|
@{
|
|
if (global)
|
|
gsub(pat, repl, str)
|
|
else
|
|
sub(pat, repl, str)
|
|
return str
|
|
@}
|
|
|
|
@{
|
|
@dots{}
|
|
text = "hi! hi yourself!"
|
|
mysub(/hi/, "howdy", text, 1)
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
In this example, the programmer wants to pass a regexp constant to the
|
|
user-defined function @code{mysub}, which in turn passes it on to
|
|
either @code{sub} or @code{gsub}. However, what really happens is that
|
|
the @code{pat} parameter is either one or zero, depending upon whether
|
|
or not @code{$0} matches @code{/hi/}.
|
|
@command{gawk} issues a warning when it sees a regexp constant used as
|
|
a parameter to a user-defined function, since passing a truth value in
|
|
this way is probably not what was intended.
|
|
@c ENDOFRANGE rec
|
|
|
|
@node Variables
|
|
@section Variables
|
|
|
|
@cindex variables, user-defined
|
|
@cindex user-defined, variables
|
|
Variables are ways of storing values at one point in your program for
|
|
use later in another part of your program. They can be manipulated
|
|
entirely within the program text, and they can also be assigned values
|
|
on the @command{awk} command line.
|
|
|
|
@menu
|
|
* Using Variables:: Using variables in your programs.
|
|
* Assignment Options:: Setting variables on the command-line and a
|
|
summary of command-line syntax. This is an
|
|
advanced method of input.
|
|
@end menu
|
|
|
|
@node Using Variables
|
|
@subsection Using Variables in a Program
|
|
|
|
Variables let you give names to values and refer to them later. Variables
|
|
have already been used in many of the examples. The name of a variable
|
|
must be a sequence of letters, digits, or underscores, and it may not begin
|
|
with a digit. Case is significant in variable names; @code{a} and @code{A}
|
|
are distinct variables.
|
|
|
|
A variable name is a valid expression by itself; it represents the
|
|
variable's current value. Variables are given new values with
|
|
@dfn{assignment operators}, @dfn{increment operators}, and
|
|
@dfn{decrement operators}.
|
|
@xref{Assignment Ops}.
|
|
@c NEXT ED: Can also be changed by sub, gsub, split
|
|
|
|
@cindex variables, built-in
|
|
@cindex variables, initializing
|
|
A few variables have special built-in meanings, such as @code{FS} (the
|
|
field separator), and @code{NF} (the number of fields in the current input
|
|
record). @xref{Built-in Variables}, for a list of the built-in variables.
|
|
These built-in variables can be used and assigned just like all other
|
|
variables, but their values are also used or changed automatically by
|
|
@command{awk}. All built-in variables' names are entirely uppercase.
|
|
|
|
Variables in @command{awk} can be assigned either numeric or string values.
|
|
The kind of value a variable holds can change over the life of a program.
|
|
By default, variables are initialized to the empty string, which
|
|
is zero if converted to a number. There is no need to
|
|
``initialize'' each variable explicitly in @command{awk},
|
|
which is what you would do in C and in most other traditional languages.
|
|
|
|
@node Assignment Options
|
|
@subsection Assigning Variables on the Command Line
|
|
@cindex variables, assigning on command line
|
|
@c comma before assigning does NOT start tertiary
|
|
@cindex command line, variables, assigning on
|
|
|
|
Any @command{awk} variable can be set by including a @dfn{variable assignment}
|
|
among the arguments on the command line when @command{awk} is invoked
|
|
(@pxref{Other Arguments}).
|
|
Such an assignment has the following form:
|
|
|
|
@example
|
|
@var{variable}=@var{text}
|
|
@end example
|
|
|
|
@c comma before assigning does NOT start tertiary
|
|
@cindex @code{-v} option, variables, assigning
|
|
@noindent
|
|
With it, a variable is set either at the beginning of the
|
|
@command{awk} run or in between input files.
|
|
When the assignment is preceded with the @option{-v} option,
|
|
as in the following:
|
|
|
|
@example
|
|
-v @var{variable}=@var{text}
|
|
@end example
|
|
|
|
@noindent
|
|
the variable is set at the very beginning, even before the
|
|
@code{BEGIN} rules are run. The @option{-v} option and its assignment
|
|
must precede all the @value{FN} arguments, as well as the program text.
|
|
(@xref{Options}, for more information about
|
|
the @option{-v} option.)
|
|
Otherwise, the variable assignment is performed at a time determined by
|
|
its position among the input file arguments---after the processing of the
|
|
preceding input file argument. For example:
|
|
|
|
@example
|
|
awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
prints the value of field number @code{n} for all input records. Before
|
|
the first file is read, the command line sets the variable @code{n}
|
|
equal to four. This causes the fourth field to be printed in lines from
|
|
the file @file{inventory-shipped}. After the first file has finished,
|
|
but before the second file is started, @code{n} is set to two, so that the
|
|
second field is printed in lines from @file{BBS-list}:
|
|
|
|
@example
|
|
$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
|
|
@print{} 15
|
|
@print{} 24
|
|
@dots{}
|
|
@print{} 555-5553
|
|
@print{} 555-3412
|
|
@dots{}
|
|
@end example
|
|
|
|
@cindex dark corner, command-line arguments
|
|
Command-line arguments are made available for explicit examination by
|
|
the @command{awk} program in the @code{ARGV} array
|
|
(@pxref{ARGC and ARGV}).
|
|
@command{awk} processes the values of command-line assignments for escape
|
|
sequences
|
|
(@pxref{Escape Sequences}).
|
|
@value{DARKCORNER}
|
|
|
|
@node Conversion
|
|
@section Conversion of Strings and Numbers
|
|
|
|
@cindex converting, strings to numbers
|
|
@cindex strings, converting
|
|
@cindex numbers, converting
|
|
@cindex converting, numbers
|
|
Strings are converted to numbers and numbers are converted to strings, if the context
|
|
of the @command{awk} program demands it. For example, if the value of
|
|
either @code{foo} or @code{bar} in the expression @samp{foo + bar}
|
|
happens to be a string, it is converted to a number before the addition
|
|
is performed. If numeric values appear in string concatenation, they
|
|
are converted to strings. Consider the following:
|
|
|
|
@example
|
|
two = 2; three = 3
|
|
print (two three) + 4
|
|
@end example
|
|
|
|
@noindent
|
|
This prints the (numeric) value 27. The numeric values of
|
|
the variables @code{two} and @code{three} are converted to strings and
|
|
concatenated together. The resulting string is converted back to the
|
|
number 23, to which 4 is then added.
|
|
|
|
@cindex null strings, converting numbers to strings
|
|
@cindex type conversion
|
|
If, for some reason, you need to force a number to be converted to a
|
|
string, concatenate the empty string, @code{""}, with that number.
|
|
To force a string to be converted to a number, add zero to that string.
|
|
A string is converted to a number by interpreting any numeric prefix
|
|
of the string as numerals:
|
|
@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"}
|
|
has a numeric value of 25.
|
|
Strings that can't be interpreted as valid numbers convert to zero.
|
|
|
|
@cindex @code{CONVFMT} variable
|
|
The exact manner in which numbers are converted into strings is controlled
|
|
by the @command{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}).
|
|
Numbers are converted using the @code{sprintf} function
|
|
with @code{CONVFMT} as the format
|
|
specifier
|
|
(@pxref{String Functions}).
|
|
|
|
@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with
|
|
at least six significant digits. For some applications, you might want to
|
|
change it to specify more precision.
|
|
On most modern machines,
|
|
17 digits is enough to capture a floating-point number's
|
|
value exactly,
|
|
most of the time.@footnote{Pathological cases can require up to
|
|
752 digits (!), but we doubt that you need to worry about this.}
|
|
|
|
@cindex dark corner, @code{CONVFMT} variable
|
|
Strange results can occur if you set @code{CONVFMT} to a string that doesn't
|
|
tell @code{sprintf} how to format floating-point numbers in a useful way.
|
|
For example, if you forget the @samp{%} in the format, @command{awk} converts
|
|
all numbers to the same constant string.
|
|
As a special case, if a number is an integer, then the result of converting
|
|
it to a string is @emph{always} an integer, no matter what the value of
|
|
@code{CONVFMT} may be. Given the following code fragment:
|
|
|
|
@example
|
|
CONVFMT = "%2.2f"
|
|
a = 12
|
|
b = a ""
|
|
@end example
|
|
|
|
@noindent
|
|
@code{b} has the value @code{"12"}, not @code{"12.00"}.
|
|
@value{DARKCORNER}
|
|
|
|
@cindex POSIX @command{awk}, @code{OFMT} variable and
|
|
@cindex @code{OFMT} variable
|
|
@cindex portability, new @command{awk} vs. old @command{awk}
|
|
@cindex @command{awk}, new vs. old, @code{OFMT} variable
|
|
Prior to the POSIX standard, @command{awk} used the value
|
|
of @code{OFMT} for converting numbers to strings. @code{OFMT}
|
|
specifies the output format to use when printing numbers with @code{print}.
|
|
@code{CONVFMT} was introduced in order to separate the semantics of
|
|
conversion from the semantics of printing. Both @code{CONVFMT} and
|
|
@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority
|
|
of cases, old @command{awk} programs do not change their behavior.
|
|
However, these semantics for @code{OFMT} are something to keep in mind if you must
|
|
port your new style program to older implementations of @command{awk}.
|
|
We recommend
|
|
that instead of changing your programs, just port @command{gawk} itself.
|
|
@xref{Print},
|
|
for more information on the @code{print} statement.
|
|
|
|
Finally, once again, where you are can matter when it comes to
|
|
converting between numbers and strings. In
|
|
@ref{Locales}, we mentioned that the
|
|
local character set and language (the locale) can affect how @command{gawk} matches
|
|
characters. The locale also affects numeric formats. In particular, for @command{awk}
|
|
programs, it affects the decimal point character. The @code{"C"} locale, and most
|
|
English-language locales, use the period character (@samp{.}) as the decimal point.
|
|
However, many (if not most) European and non-English locales use the comma (@samp{,})
|
|
as the decimal point character.
|
|
|
|
The POSIX standard says that @command{awk} always uses the period as the decimal
|
|
point when reading the @command{awk} program source code, and for command-line
|
|
variable assignments (@pxref{Other Arguments}).
|
|
However, when interpreting input data, for @code{print} and @code{printf} output,
|
|
and for number to string conversion, the local decimal point character is used.
|
|
As of @value{PVERSION} 3.1.3, @command{gawk} fully complies with this aspect
|
|
of the standard. Here are some examples indicating the difference in behavior,
|
|
on a GNU/Linux system:
|
|
|
|
@example
|
|
$ gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}'
|
|
@print{} 3.14159
|
|
$ LC_ALL=en_DK gawk 'BEGIN @{ printf "%g\n", 3.1415927 @}'
|
|
@print{} 3,14159
|
|
$ echo 4,321 | gawk '@{ print $1 + 1 @}'
|
|
@print{} 5
|
|
$ echo 4,321 | LC_ALL=en_DK gawk '@{ print $1 + 1 @}'
|
|
@print{} 5,321
|
|
@end example
|
|
|
|
@noindent
|
|
The @samp{en_DK} locale is for English in Denmark, where the comma acts as
|
|
the decimal point separator. In the normal @code{"C"} locale, @command{gawk}
|
|
treats @samp{4,321} as @samp{4}, while in the Danish locale, it's treated
|
|
as the full number, @samp{4.321}.
|
|
|
|
@node Arithmetic Ops
|
|
@section Arithmetic Operators
|
|
@cindex arithmetic operators
|
|
@cindex operators, arithmetic
|
|
@c @cindex addition
|
|
@c @cindex subtraction
|
|
@c @cindex multiplication
|
|
@c @cindex division
|
|
@c @cindex remainder
|
|
@c @cindex quotient
|
|
@c @cindex exponentiation
|
|
|
|
The @command{awk} language uses the common arithmetic operators when
|
|
evaluating expressions. All of these arithmetic operators follow normal
|
|
precedence rules and work as you would expect them to.
|
|
|
|
The following example uses a file named @file{grades}, which contains
|
|
a list of student names as well as three test scores per student (it's
|
|
a small class):
|
|
|
|
@example
|
|
Pat 100 97 58
|
|
Sandy 84 72 93
|
|
Chris 72 92 89
|
|
@end example
|
|
|
|
@noindent
|
|
This programs takes the file @file{grades} and prints the average
|
|
of the scores:
|
|
|
|
@example
|
|
$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3
|
|
> print $1, avg @}' grades
|
|
@print{} Pat 85
|
|
@print{} Sandy 83
|
|
@print{} Chris 84.3333
|
|
@end example
|
|
|
|
The following list provides the arithmetic operators in @command{awk}, in order from
|
|
the highest precedence to the lowest:
|
|
|
|
@table @code
|
|
@item - @var{x}
|
|
Negation.
|
|
|
|
@item + @var{x}
|
|
Unary plus; the expression is converted to a number.
|
|
|
|
@cindex POSIX @command{awk}, arithmetic operators and
|
|
@item @var{x} ^ @var{y}
|
|
@itemx @var{x} ** @var{y}
|
|
Exponentiation; @var{x} raised to the @var{y} power. @samp{2 ^ 3} has
|
|
the value eight; the character sequence @samp{**} is equivalent to
|
|
@samp{^}.
|
|
|
|
@item @var{x} * @var{y}
|
|
Multiplication.
|
|
|
|
@cindex troubleshooting, division
|
|
@cindex division
|
|
@item @var{x} / @var{y}
|
|
Division; because all numbers in @command{awk} are floating-point
|
|
numbers, the result is @emph{not} rounded to an integer---@samp{3 / 4} has
|
|
the value 0.75. (It is a common mistake, especially for C programmers,
|
|
to forget that @emph{all} numbers in @command{awk} are floating-point,
|
|
and that division of integer-looking constants produces a real number,
|
|
not an integer.)
|
|
|
|
@item @var{x} % @var{y}
|
|
Remainder; further discussion is provided in the text, just
|
|
after this list.
|
|
|
|
@item @var{x} + @var{y}
|
|
Addition.
|
|
|
|
@item @var{x} - @var{y}
|
|
Subtraction.
|
|
@end table
|
|
|
|
Unary plus and minus have the same precedence,
|
|
the multiplication operators all have the same precedence, and
|
|
addition and subtraction have the same precedence.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, trunc-mod operation
|
|
@cindex trunc-mod operation
|
|
When computing the remainder of @code{@var{x} % @var{y}},
|
|
the quotient is rounded toward zero to an integer and
|
|
multiplied by @var{y}. This result is subtracted from @var{x};
|
|
this operation is sometimes known as ``trunc-mod.'' The following
|
|
relation always holds:
|
|
|
|
@example
|
|
b * int(a / b) + (a % b) == a
|
|
@end example
|
|
|
|
One possibly undesirable effect of this definition of remainder is that
|
|
@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus:
|
|
|
|
@example
|
|
-17 % 8 = -1
|
|
@end example
|
|
|
|
In other @command{awk} implementations, the signedness of the remainder
|
|
may be machine-dependent.
|
|
@c !!! what does posix say?
|
|
|
|
@cindex portability, @code{**} operator and
|
|
@cindex @code{*} (asterisk), @code{**} operator
|
|
@cindex asterisk (@code{*}), @code{**} operator
|
|
@strong{Note:}
|
|
The POSIX standard only specifies the use of @samp{^}
|
|
for exponentiation.
|
|
For maximum portability, do not use the @samp{**} operator.
|
|
|
|
@node Concatenation
|
|
@section String Concatenation
|
|
@cindex Kernighan, Brian
|
|
@quotation
|
|
@i{It seemed like a good idea at the time.}@*
|
|
Brian Kernighan
|
|
@end quotation
|
|
|
|
@cindex string operators
|
|
@cindex operators, string
|
|
@cindex concatenating
|
|
There is only one string operation: concatenation. It does not have a
|
|
specific operator to represent it. Instead, concatenation is performed by
|
|
writing expressions next to one another, with no operator. For example:
|
|
|
|
@example
|
|
$ awk '@{ print "Field number one: " $1 @}' BBS-list
|
|
@print{} Field number one: aardvark
|
|
@print{} Field number one: alpo-net
|
|
@dots{}
|
|
@end example
|
|
|
|
Without the space in the string constant after the @samp{:}, the line
|
|
runs together. For example:
|
|
|
|
@example
|
|
$ awk '@{ print "Field number one:" $1 @}' BBS-list
|
|
@print{} Field number one:aardvark
|
|
@print{} Field number one:alpo-net
|
|
@dots{}
|
|
@end example
|
|
|
|
@cindex troubleshooting, string concatenation
|
|
Because string concatenation does not have an explicit operator, it is
|
|
often necessary to insure that it happens at the right time by using
|
|
parentheses to enclose the items to concatenate. For example, the
|
|
following code fragment does not concatenate @code{file} and @code{name}
|
|
as you might expect:
|
|
|
|
@example
|
|
file = "file"
|
|
name = "name"
|
|
print "something meaningful" > file name
|
|
@end example
|
|
|
|
@noindent
|
|
It is necessary to use the following:
|
|
|
|
@example
|
|
print "something meaningful" > (file name)
|
|
@end example
|
|
|
|
@cindex order of evaluation, concatenation
|
|
@cindex evaluation order, concatenation
|
|
@cindex side effects
|
|
Parentheses should be used around concatenation in all but the
|
|
most common contexts, such as on the righthand side of @samp{=}.
|
|
Be careful about the kinds of expressions used in string concatenation.
|
|
In particular, the order of evaluation of expressions used for concatenation
|
|
is undefined in the @command{awk} language. Consider this example:
|
|
|
|
@example
|
|
BEGIN @{
|
|
a = "don't"
|
|
print (a " " (a = "panic"))
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
It is not defined whether the assignment to @code{a} happens
|
|
before or after the value of @code{a} is retrieved for producing the
|
|
concatenated value. The result could be either @samp{don't panic},
|
|
or @samp{panic panic}.
|
|
@c see test/nasty.awk for a worse example
|
|
The precedence of concatenation, when mixed with other operators, is often
|
|
counter-intuitive. Consider this example:
|
|
|
|
@ignore
|
|
> To: bug-gnu-utils@@gnu.org
|
|
> CC: arnold@gnu.org
|
|
> Subject: gawk 3.0.4 bug with {print -12 " " -24}
|
|
> From: Russell Schulz <Russell_Schulz@locutus.ofB.ORG>
|
|
> Date: Tue, 8 Feb 2000 19:56:08 -0700
|
|
>
|
|
> gawk 3.0.4 on NT gives me:
|
|
>
|
|
> prompt> cat bad.awk
|
|
> BEGIN { print -12 " " -24; }
|
|
>
|
|
> prompt> gawk -f bad.awk
|
|
> -12-24
|
|
>
|
|
> when I would expect
|
|
>
|
|
> -12 -24
|
|
>
|
|
> I have not investigated the source, or other implementations. The
|
|
> bug is there on my NT and DOS versions 2.15.6 .
|
|
@end ignore
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print -12 " " -24 @}'
|
|
@print{} -12-24
|
|
@end example
|
|
|
|
This ``obviously'' is concatenating @minus{}12, a space, and @minus{}24.
|
|
But where did the space disappear to?
|
|
The answer lies in the combination of operator precedences and
|
|
@command{awk}'s automatic conversion rules. To get the desired result,
|
|
write the program in the following manner:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print -12 " " (-24) @}'
|
|
@print{} -12 -24
|
|
@end example
|
|
|
|
This forces @command{awk} to treat the @samp{-} on the @samp{-24} as unary.
|
|
Otherwise, it's parsed as follows:
|
|
|
|
@display
|
|
@minus{}12 (@code{"@ "} @minus{} 24)
|
|
@result{} @minus{}12 (0 @minus{} 24)
|
|
@result{} @minus{}12 (@minus{}24)
|
|
@result{} @minus{}12@minus{}24
|
|
@end display
|
|
|
|
As mentioned earlier,
|
|
when doing concatenation, @emph{parenthesize}. Otherwise,
|
|
you're never quite sure what you'll get.
|
|
|
|
@node Assignment Ops
|
|
@section Assignment Expressions
|
|
@c STARTOFRANGE asop
|
|
@cindex assignment operators
|
|
@c STARTOFRANGE opas
|
|
@cindex operators, assignment
|
|
@c STARTOFRANGE exas
|
|
@cindex expressions, assignment
|
|
@cindex @code{=} (equals sign), @code{=} operator
|
|
@cindex equals sign (@code{=}), @code{=} operator
|
|
An @dfn{assignment} is an expression that stores a (usually different)
|
|
value into a variable. For example, let's assign the value one to the variable
|
|
@code{z}:
|
|
|
|
@example
|
|
z = 1
|
|
@end example
|
|
|
|
After this expression is executed, the variable @code{z} has the value one.
|
|
Whatever old value @code{z} had before the assignment is forgotten.
|
|
|
|
Assignments can also store string values. For example, the
|
|
following stores
|
|
the value @code{"this food is good"} in the variable @code{message}:
|
|
|
|
@example
|
|
thing = "food"
|
|
predicate = "good"
|
|
message = "this " thing " is " predicate
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex side effects, assignment expressions
|
|
This also illustrates string concatenation.
|
|
The @samp{=} sign is called an @dfn{assignment operator}. It is the
|
|
simplest assignment operator because the value of the righthand
|
|
operand is stored unchanged.
|
|
Most operators (addition, concatenation, and so on) have no effect
|
|
except to compute a value. If the value isn't used, there's no reason to
|
|
use the operator. An assignment operator is different; it does
|
|
produce a value, but even if you ignore it, the assignment still
|
|
makes itself felt through the alteration of the variable. We call this
|
|
a @dfn{side effect}.
|
|
|
|
@cindex lvalues/rvalues
|
|
@cindex rvalues/lvalues
|
|
@cindex assignment operators, lvalues/rvalues
|
|
@cindex operators, assignment
|
|
The lefthand operand of an assignment need not be a variable
|
|
(@pxref{Variables}); it can also be a field
|
|
(@pxref{Changing Fields}) or
|
|
an array element (@pxref{Arrays}).
|
|
These are all called @dfn{lvalues},
|
|
which means they can appear on the lefthand side of an assignment operator.
|
|
The righthand operand may be any expression; it produces the new value
|
|
that the assignment stores in the specified variable, field, or array
|
|
element. (Such values are called @dfn{rvalues}.)
|
|
|
|
@cindex variables, types of
|
|
It is important to note that variables do @emph{not} have permanent types.
|
|
A variable's type is simply the type of whatever value it happens
|
|
to hold at the moment. In the following program fragment, the variable
|
|
@code{foo} has a numeric value at first, and a string value later on:
|
|
|
|
@example
|
|
foo = 1
|
|
print foo
|
|
foo = "bar"
|
|
print foo
|
|
@end example
|
|
|
|
@noindent
|
|
When the second assignment gives @code{foo} a string value, the fact that
|
|
it previously had a numeric value is forgotten.
|
|
|
|
String values that do not begin with a digit have a numeric value of
|
|
zero. After executing the following code, the value of @code{foo} is five:
|
|
|
|
@example
|
|
foo = "a string"
|
|
foo = foo + 5
|
|
@end example
|
|
|
|
@noindent
|
|
@strong{Note:} Using a variable as a number and then later as a string
|
|
can be confusing and is poor programming style. The previous two examples
|
|
illustrate how @command{awk} works, @emph{not} how you should write your
|
|
programs!
|
|
|
|
An assignment is an expression, so it has a value---the same value that
|
|
is assigned. Thus, @samp{z = 1} is an expression with the value one.
|
|
One consequence of this is that you can write multiple assignments together,
|
|
such as:
|
|
|
|
@example
|
|
x = y = z = 5
|
|
@end example
|
|
|
|
@noindent
|
|
This example stores the value five in all three variables
|
|
(@code{x}, @code{y}, and @code{z}).
|
|
It does so because the
|
|
value of @samp{z = 5}, which is five, is stored into @code{y} and then
|
|
the value of @samp{y = z = 5}, which is five, is stored into @code{x}.
|
|
|
|
Assignments may be used anywhere an expression is called for. For
|
|
example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one,
|
|
and then test whether @code{x} equals one. But this style tends to make
|
|
programs hard to read; such nesting of assignments should be avoided,
|
|
except perhaps in a one-shot program.
|
|
|
|
@cindex @code{+} (plus sign), @code{+=} operator
|
|
@cindex plus sign (@code{+}), @code{+=} operator
|
|
Aside from @samp{=}, there are several other assignment operators that
|
|
do arithmetic with the old value of the variable. For example, the
|
|
operator @samp{+=} computes a new value by adding the righthand value
|
|
to the old value of the variable. Thus, the following assignment adds
|
|
five to the value of @code{foo}:
|
|
|
|
@example
|
|
foo += 5
|
|
@end example
|
|
|
|
@noindent
|
|
This is equivalent to the following:
|
|
|
|
@example
|
|
foo = foo + 5
|
|
@end example
|
|
|
|
@noindent
|
|
Use whichever makes the meaning of your program clearer.
|
|
|
|
There are situations where using @samp{+=} (or any assignment operator)
|
|
is @emph{not} the same as simply repeating the lefthand operand in the
|
|
righthand expression. For example:
|
|
|
|
@cindex Rankin, Pat
|
|
@example
|
|
# Thanks to Pat Rankin for this example
|
|
BEGIN @{
|
|
foo[rand()] += 5
|
|
for (x in foo)
|
|
print x, foo[x]
|
|
|
|
bar[rand()] = bar[rand()] + 5
|
|
for (x in bar)
|
|
print x, bar[x]
|
|
@}
|
|
@end example
|
|
|
|
@cindex operators, assignment, evaluation order
|
|
@cindex assignment operators, evaluation order
|
|
@noindent
|
|
The indices of @code{bar} are practically guaranteed to be different, because
|
|
@code{rand} returns different values each time it is called.
|
|
(Arrays and the @code{rand} function haven't been covered yet.
|
|
@xref{Arrays},
|
|
and see @ref{Numeric Functions}, for more information).
|
|
This example illustrates an important fact about assignment
|
|
operators: the lefthand expression is only evaluated @emph{once}.
|
|
It is up to the implementation as to which expression is evaluated
|
|
first, the lefthand or the righthand.
|
|
Consider this example:
|
|
|
|
@example
|
|
i = 1
|
|
a[i += 2] = i + 1
|
|
@end example
|
|
|
|
@noindent
|
|
The value of @code{a[3]} could be either two or four.
|
|
|
|
Here is a table of the arithmetic assignment operators. In each
|
|
case, the righthand operand is an expression whose value is converted
|
|
to a number.
|
|
|
|
@ignore
|
|
@table @code
|
|
@item @var{lvalue} += @var{increment}
|
|
Adds @var{increment} to the value of @var{lvalue}.
|
|
|
|
@item @var{lvalue} -= @var{decrement}
|
|
Subtracts @var{decrement} from the value of @var{lvalue}.
|
|
|
|
@item @var{lvalue} *= @var{coefficient}
|
|
Multiplies the value of @var{lvalue} by @var{coefficient}.
|
|
|
|
@item @var{lvalue} /= @var{divisor}
|
|
Divides the value of @var{lvalue} by @var{divisor}.
|
|
|
|
@item @var{lvalue} %= @var{modulus}
|
|
Sets @var{lvalue} to its remainder by @var{modulus}.
|
|
|
|
@cindex @command{awk} language, POSIX version
|
|
@cindex POSIX @command{awk}
|
|
@item @var{lvalue} ^= @var{power}
|
|
@itemx @var{lvalue} **= @var{power}
|
|
Raises @var{lvalue} to the power @var{power}.
|
|
(Only the @samp{^=} operator is specified by POSIX.)
|
|
@end table
|
|
@end ignore
|
|
|
|
@cindex @code{-} (hyphen), @code{-=} operator
|
|
@cindex hyphen (@code{-}), @code{-=} operator
|
|
@cindex @code{*} (asterisk), @code{*=} operator
|
|
@cindex asterisk (@code{*}), @code{*=} operator
|
|
@cindex @code{/} (forward slash), @code{/=} operator
|
|
@cindex forward slash (@code{/}), @code{/=} operator
|
|
@cindex @code{%} (percent sign), @code{%=} operator
|
|
@cindex percent sign (@code{%}), @code{%=} operator
|
|
@cindex @code{^} (caret), @code{^=} operator
|
|
@cindex caret (@code{^}), @code{^=} operator
|
|
@cindex @code{*} (asterisk), @code{**=} operator
|
|
@cindex asterisk (@code{*}), @code{**=} operator
|
|
@multitable {@var{lvalue} *= @var{coefficient}} {Subtracts @var{decrement} from the value of @var{lvalue}.}
|
|
@item @var{lvalue} @code{+=} @var{increment} @tab Adds @var{increment} to the value of @var{lvalue}.
|
|
|
|
@item @var{lvalue} @code{-=} @var{decrement} @tab Subtracts @var{decrement} from the value of @var{lvalue}.
|
|
|
|
@item @var{lvalue} @code{*=} @var{coefficient} @tab Multiplies the value of @var{lvalue} by @var{coefficient}.
|
|
|
|
@item @var{lvalue} @code{/=} @var{divisor} @tab Divides the value of @var{lvalue} by @var{divisor}.
|
|
|
|
@item @var{lvalue} @code{%=} @var{modulus} @tab Sets @var{lvalue} to its remainder by @var{modulus}.
|
|
|
|
@cindex @command{awk} language, POSIX version
|
|
@cindex POSIX @command{awk}
|
|
@item @var{lvalue} @code{^=} @var{power} @tab
|
|
@item @var{lvalue} @code{**=} @var{power} @tab Raises @var{lvalue} to the power @var{power}.
|
|
@end multitable
|
|
|
|
@cindex POSIX @command{awk}, @code{**=} operator and
|
|
@cindex portability, @code{**=} operator and
|
|
@strong{Note:}
|
|
Only the @samp{^=} operator is specified by POSIX.
|
|
For maximum portability, do not use the @samp{**=} operator.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Syntactic Ambiguities Between @samp{/=} and Regular Expressions
|
|
@cindex advanced features, regexp constants
|
|
@cindex dark corner, regexp constants, @code{/=} operator and
|
|
@cindex @code{/} (forward slash), @code{/=} operator, vs. @code{/=@dots{}/} regexp constant
|
|
@cindex forward slash (@code{/}), @code{/=} operator, vs. @code{/=@dots{}/} regexp constant
|
|
@cindex regexp constants, @code{/=@dots{}/}, @code{/=} operator and
|
|
|
|
@c derived from email from "Nelson H. F. Beebe" <beebe@math.utah.edu>
|
|
@c Date: Mon, 1 Sep 1997 13:38:35 -0600 (MDT)
|
|
|
|
@cindex dark corner
|
|
@cindex ambiguity, syntactic: @code{/=} operator vs. @code{/=@dots{}/} regexp constant
|
|
@cindex syntactic ambiguity: @code{/=} operator vs. @code{/=@dots{}/} regexp constant
|
|
@cindex @code{/=} operator vs. @code{/=@dots{}/} regexp constant
|
|
There is a syntactic ambiguity between the @samp{/=} assignment
|
|
operator and regexp constants whose first character is an @samp{=}.
|
|
@value{DARKCORNER}
|
|
This is most notable in commercial @command{awk} versions.
|
|
For example:
|
|
|
|
@example
|
|
$ awk /==/ /dev/null
|
|
@error{} awk: syntax error at source line 1
|
|
@error{} context is
|
|
@error{} >>> /= <<<
|
|
@error{} awk: bailing out at source line 1
|
|
@end example
|
|
|
|
@noindent
|
|
A workaround is:
|
|
|
|
@example
|
|
awk '/[=]=/' /dev/null
|
|
@end example
|
|
|
|
@command{gawk} does not have this problem,
|
|
nor do the other
|
|
freely available versions described in
|
|
@ref{Other Versions}.
|
|
@c ENDOFRANGE exas
|
|
@c ENDOFRANGE opas
|
|
@c ENDOFRANGE asop
|
|
|
|
@node Increment Ops
|
|
@section Increment and Decrement Operators
|
|
|
|
@c STARTOFRANGE inop
|
|
@cindex increment operators
|
|
@c STARTOFRANGE opde
|
|
@cindex operators, decrement/increment
|
|
@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of
|
|
a variable by one. An assignment operator can do the same thing, so
|
|
the increment operators add no power to the @command{awk} language; however, they
|
|
are convenient abbreviations for very common operations.
|
|
|
|
@cindex side effects
|
|
@cindex @code{+} (plus sign), decrement/increment operators
|
|
@cindex plus sign (@code{+}), decrement/increment operators
|
|
@cindex side effects, decrement/increment operators
|
|
The operator used for adding one is written @samp{++}. It can be used to increment
|
|
a variable either before or after taking its value.
|
|
To pre-increment a variable @code{v}, write @samp{++v}. This adds
|
|
one to the value of @code{v}---that new value is also the value of the
|
|
expression. (The assignment expression @samp{v += 1} is completely
|
|
equivalent.)
|
|
Writing the @samp{++} after the variable specifies post-increment. This
|
|
increments the variable value just the same; the difference is that the
|
|
value of the increment expression itself is the variable's @emph{old}
|
|
value. Thus, if @code{foo} has the value four, then the expression @samp{foo++}
|
|
has the value four, but it changes the value of @code{foo} to five.
|
|
In other words, the operator returns the old value of the variable,
|
|
but with the side effect of incrementing it.
|
|
|
|
The post-increment @samp{foo++} is nearly the same as writing @samp{(foo
|
|
+= 1) - 1}. It is not perfectly equivalent because all numbers in
|
|
@command{awk} are floating-point---in floating-point, @samp{foo + 1 - 1} does
|
|
not necessarily equal @code{foo}. But the difference is minute as
|
|
long as you stick to numbers that are fairly small (less than 10e12).
|
|
|
|
@cindex @code{$} (dollar sign), incrementing fields and arrays
|
|
@cindex dollar sign (@code{$}), incrementing fields and arrays
|
|
Fields and array elements are incremented
|
|
just like variables. (Use @samp{$(i++)} when you want to do a field reference
|
|
and a variable increment at the same time. The parentheses are necessary
|
|
because of the precedence of the field reference operator @samp{$}.)
|
|
|
|
@cindex decrement operators
|
|
The decrement operator @samp{--} works just like @samp{++}, except that
|
|
it subtracts one instead of adding it. As with @samp{++}, it can be used before
|
|
the lvalue to pre-decrement or after it to post-decrement.
|
|
Following is a summary of increment and decrement expressions:
|
|
|
|
@table @code
|
|
@cindex @code{+} (plus sign), @code{++} operator
|
|
@cindex plus sign (@code{+}), @code{++} operator
|
|
@item ++@var{lvalue}
|
|
This expression increments @var{lvalue}, and the new value becomes the
|
|
value of the expression.
|
|
|
|
@item @var{lvalue}++
|
|
This expression increments @var{lvalue}, but
|
|
the value of the expression is the @emph{old} value of @var{lvalue}.
|
|
|
|
@cindex @code{-} (hyphen), @code{--} operator
|
|
@cindex hyphen (@code{-}), @code{--} operator
|
|
@item --@var{lvalue}
|
|
This expression is
|
|
like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It
|
|
decrements @var{lvalue} and delivers the value that is the result.
|
|
|
|
@item @var{lvalue}--
|
|
This expression is
|
|
like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It
|
|
decrements @var{lvalue}. The value of the expression is the @emph{old}
|
|
value of @var{lvalue}.
|
|
@end table
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Operator Evaluation Order
|
|
@c comma before precedence does NOT start tertiary
|
|
@cindex advanced features, operators, precedence
|
|
@cindex precedence
|
|
@cindex operators, precedence
|
|
@cindex portability, operators
|
|
@cindex evaluation order
|
|
@cindex Marx, Groucho
|
|
@quotation
|
|
@i{Doctor, doctor! It hurts when I do this!@*
|
|
So don't do that!}@*
|
|
Groucho Marx
|
|
@end quotation
|
|
|
|
@noindent
|
|
What happens for something like the following?
|
|
|
|
@example
|
|
b = 6
|
|
print b += b++
|
|
@end example
|
|
|
|
@noindent
|
|
Or something even stranger?
|
|
|
|
@example
|
|
b = 6
|
|
b += ++b + b++
|
|
print b
|
|
@end example
|
|
|
|
@cindex side effects
|
|
In other words, when do the various side effects prescribed by the
|
|
postfix operators (@samp{b++}) take effect?
|
|
When side effects happen is @dfn{implementation defined}.
|
|
In other words, it is up to the particular version of @command{awk}.
|
|
The result for the first example may be 12 or 13, and for the second, it
|
|
may be 22 or 23.
|
|
|
|
In short, doing things like this is not recommended and definitely
|
|
not anything that you can rely upon for portability.
|
|
You should avoid such things in your own programs.
|
|
@c You'll sleep better at night and be able to look at yourself
|
|
@c in the mirror in the morning.
|
|
@c ENDOFRANGE inop
|
|
@c ENDOFRANGE opde
|
|
@c ENDOFRANGE deop
|
|
|
|
@node Truth Values
|
|
@section True and False in @command{awk}
|
|
@cindex truth values
|
|
@cindex logical false/true
|
|
@cindex false, logical
|
|
@cindex true, logical
|
|
|
|
@cindex null strings
|
|
Many programming languages have a special representation for the concepts
|
|
of ``true'' and ``false.'' Such languages usually use the special
|
|
constants @code{true} and @code{false}, or perhaps their uppercase
|
|
equivalents.
|
|
However, @command{awk} is different.
|
|
It borrows a very simple concept of true and
|
|
false from C. In @command{awk}, any nonzero numeric value @emph{or} any
|
|
nonempty string value is true. Any other value (zero or the null
|
|
string @code{""}) is false. The following program prints @samp{A strange
|
|
truth value} three times:
|
|
|
|
@example
|
|
BEGIN @{
|
|
if (3.1415927)
|
|
print "A strange truth value"
|
|
if ("Four Score And Seven Years Ago")
|
|
print "A strange truth value"
|
|
if (j = 57)
|
|
print "A strange truth value"
|
|
@}
|
|
@end example
|
|
|
|
@cindex dark corner
|
|
There is a surprising consequence of the ``nonzero or non-null'' rule:
|
|
the string constant @code{"0"} is actually true, because it is non-null.
|
|
@value{DARKCORNER}
|
|
|
|
@node Typing and Comparison
|
|
@section Variable Typing and Comparison Expressions
|
|
@quotation
|
|
@i{The Guide is definitive. Reality is frequently inaccurate.}@*
|
|
The Hitchhiker's Guide to the Galaxy
|
|
@end quotation
|
|
|
|
@c STARTOFRANGE comex
|
|
@cindex comparison expressions
|
|
@c STARTOFRANGE excom
|
|
@cindex expressions, comparison
|
|
@cindex expressions, matching, See comparison expressions
|
|
@cindex matching, expressions, See comparison expressions
|
|
@cindex relational operators, See comparison operators
|
|
@c comma is part of See
|
|
@cindex operators, relational, See operators, comparison
|
|
@c STARTOFRANGE varting
|
|
@cindex variable typing
|
|
@c STARTOFRANGE vartypc
|
|
@cindex variables, types of, comparison expressions and
|
|
Unlike other programming languages, @command{awk} variables do not have a
|
|
fixed type. Instead, they can be either a number or a string, depending
|
|
upon the value that is assigned to them.
|
|
|
|
@cindex numeric, strings
|
|
@cindex strings, numeric
|
|
@cindex POSIX @command{awk}, numeric strings and
|
|
The 1992 POSIX standard introduced
|
|
the concept of a @dfn{numeric string}, which is simply a string that looks
|
|
like a number---for example, @code{@w{" +2"}}. This concept is used
|
|
for determining the type of a variable.
|
|
The type of the variable is important because the types of two variables
|
|
determine how they are compared.
|
|
In @command{gawk}, variable typing follows these rules:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
A numeric constant or the result of a numeric operation has the @var{numeric}
|
|
attribute.
|
|
|
|
@item
|
|
A string constant or the result of a string operation has the @var{string}
|
|
attribute.
|
|
|
|
@item
|
|
Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,
|
|
@code{ENVIRON} elements, and the
|
|
elements of an array created by @code{split} that are numeric strings
|
|
have the @var{strnum} attribute. Otherwise, they have the @var{string}
|
|
attribute.
|
|
Uninitialized variables also have the @var{strnum} attribute.
|
|
|
|
@item
|
|
Attributes propagate across assignments but are not changed by
|
|
any use.
|
|
@c (Although a use may cause the entity to acquire an additional
|
|
@c value such that it has both a numeric and string value, this leaves the
|
|
@c attribute unchanged.)
|
|
@c This is important but not relevant
|
|
@end itemize
|
|
|
|
The last rule is particularly important. In the following program,
|
|
@code{a} has numeric type, even though it is later used in a string
|
|
operation:
|
|
|
|
@example
|
|
BEGIN @{
|
|
a = 12.345
|
|
b = a " is a cute number"
|
|
print b
|
|
@}
|
|
@end example
|
|
|
|
When two operands are compared, either string comparison or numeric comparison
|
|
may be used. This depends upon the attributes of the operands, according to the
|
|
following symmetric matrix:
|
|
|
|
@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables
|
|
@tex
|
|
\centerline{
|
|
\vbox{\bigskip % space above the table (about 1 linespace)
|
|
% Because we have vertical rules, we can't let TeX insert interline space
|
|
% in its usual way.
|
|
\offinterlineskip
|
|
%
|
|
% Define the table template. & separates columns, and \cr ends the
|
|
% template (and each row). # is replaced by the text of that entry on
|
|
% each row. The template for the first column breaks down like this:
|
|
% \strut -- a way to make each line have the height and depth
|
|
% of a normal line of type, since we turned off interline spacing.
|
|
% \hfil -- infinite glue; has the effect of right-justifying in this case.
|
|
% # -- replaced by the text (for instance, `STRNUM', in the last row).
|
|
% \quad -- about the width of an `M'. Just separates the columns.
|
|
%
|
|
% The second column (\vrule#) is what generates the vertical rule that
|
|
% spans table rows.
|
|
%
|
|
% The doubled && before the next entry means `repeat the following
|
|
% template as many times as necessary on each line' -- in our case, twice.
|
|
%
|
|
% The template itself, \quad#\hfil, left-justifies with a little space before.
|
|
%
|
|
\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr
|
|
&&STRING &NUMERIC &STRNUM\cr
|
|
% The \omit tells TeX to skip inserting the template for this column on
|
|
% this particular row. In this case, we only want a little extra space
|
|
% to separate the heading row from the rule below it. the depth 2pt --
|
|
% `\vrule depth 2pt' is that little space.
|
|
\omit &depth 2pt\cr
|
|
% This is the horizontal rule below the heading. Since it has nothing to
|
|
% do with the columns of the table, we use \noalign to get it in there.
|
|
\noalign{\hrule}
|
|
% Like above, this time a little more space.
|
|
\omit &depth 4pt\cr
|
|
% The remaining rows have nothing special about them.
|
|
STRING &&string &string &string\cr
|
|
NUMERIC &&string &numeric &numeric\cr
|
|
STRNUM &&string &numeric &numeric\cr
|
|
}}}
|
|
@end tex
|
|
@ifnottex
|
|
@display
|
|
+----------------------------------------------
|
|
| STRING NUMERIC STRNUM
|
|
--------+----------------------------------------------
|
|
|
|
|
STRING | string string string
|
|
|
|
|
NUMERIC | string numeric numeric
|
|
|
|
|
STRNUM | string numeric numeric
|
|
--------+----------------------------------------------
|
|
@end display
|
|
@end ifnottex
|
|
|
|
The basic idea is that user input that looks numeric---and @emph{only}
|
|
user input---should be treated as numeric, even though it is actually
|
|
made of characters and is therefore also a string.
|
|
Thus, for example, the string constant @w{@code{" +3.14"}}
|
|
is a string, even though it looks numeric,
|
|
and is @emph{never} treated as number for comparison
|
|
purposes.
|
|
|
|
In short, when one operand is a ``pure'' string, such as a string
|
|
constant, then a string comparison is performed. Otherwise, a
|
|
numeric comparison is performed.@footnote{The POSIX standard is under
|
|
revision. The revised standard's rules for typing and comparison are
|
|
the same as just described for @command{gawk}.}
|
|
|
|
@dfn{Comparison expressions} compare strings or numbers for
|
|
relationships such as equality. They are written using @dfn{relational
|
|
operators}, which are a superset of those in C. Here is a table of
|
|
them:
|
|
|
|
@cindex @code{<} (left angle bracket), @code{<} operator
|
|
@cindex left angle bracket (@code{<}), @code{<} operator
|
|
@cindex @code{<} (left angle bracket), @code{<=} operator
|
|
@cindex left angle bracket (@code{<}), @code{<=} operator
|
|
@cindex @code{>} (right angle bracket), @code{>=} operator
|
|
@cindex right angle bracket (@code{>}), @code{>=} operator
|
|
@cindex @code{>} (right angle bracket), @code{>} operator
|
|
@cindex right angle bracket (@code{>}), @code{>} operator
|
|
@cindex @code{=} (equals sign), @code{==} operator
|
|
@cindex equals sign (@code{=}), @code{==} operator
|
|
@cindex @code{!} (exclamation point), @code{!=} operator
|
|
@cindex exclamation point (@code{!}), @code{!=} operator
|
|
@cindex @code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
@cindex @code{in} operator
|
|
@table @code
|
|
@item @var{x} < @var{y}
|
|
True if @var{x} is less than @var{y}.
|
|
|
|
@item @var{x} <= @var{y}
|
|
True if @var{x} is less than or equal to @var{y}.
|
|
|
|
@item @var{x} > @var{y}
|
|
True if @var{x} is greater than @var{y}.
|
|
|
|
@item @var{x} >= @var{y}
|
|
True if @var{x} is greater than or equal to @var{y}.
|
|
|
|
@item @var{x} == @var{y}
|
|
True if @var{x} is equal to @var{y}.
|
|
|
|
@item @var{x} != @var{y}
|
|
True if @var{x} is not equal to @var{y}.
|
|
|
|
@item @var{x} ~ @var{y}
|
|
True if the string @var{x} matches the regexp denoted by @var{y}.
|
|
|
|
@item @var{x} !~ @var{y}
|
|
True if the string @var{x} does not match the regexp denoted by @var{y}.
|
|
|
|
@item @var{subscript} in @var{array}
|
|
True if the array @var{array} has an element with the subscript @var{subscript}.
|
|
@end table
|
|
|
|
Comparison expressions have the value one if true and zero if false.
|
|
When comparing operands of mixed types, numeric operands are converted
|
|
to strings using the value of @code{CONVFMT}
|
|
(@pxref{Conversion}).
|
|
|
|
Strings are compared
|
|
by comparing the first character of each, then the second character of each,
|
|
and so on. Thus, @code{"10"} is less than @code{"9"}. If there are two
|
|
strings where one is a prefix of the other, the shorter string is less than
|
|
the longer one. Thus, @code{"abc"} is less than @code{"abcd"}.
|
|
|
|
@cindex troubleshooting, @code{==} operator
|
|
It is very easy to accidentally mistype the @samp{==} operator and
|
|
leave off one of the @samp{=} characters. The result is still valid @command{awk}
|
|
code, but the program does not do what is intended:
|
|
|
|
@example
|
|
if (a = b) # oops! should be a == b
|
|
@dots{}
|
|
else
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
Unless @code{b} happens to be zero or the null string, the @code{if}
|
|
part of the test always succeeds. Because the operators are
|
|
so similar, this kind of error is very difficult to spot when
|
|
scanning the source code.
|
|
|
|
@cindex @command{gawk}, comparison operators and
|
|
The following table of expressions illustrates the kind of comparison
|
|
@command{gawk} performs, as well as what the result of the comparison is:
|
|
|
|
@table @code
|
|
@item 1.5 <= 2.0
|
|
numeric comparison (true)
|
|
|
|
@item "abc" >= "xyz"
|
|
string comparison (false)
|
|
|
|
@item 1.5 != " +2"
|
|
string comparison (true)
|
|
|
|
@item "1e2" < "3"
|
|
string comparison (true)
|
|
|
|
@item a = 2; b = "2"
|
|
@itemx a == b
|
|
string comparison (true)
|
|
|
|
@item a = 2; b = " +2"
|
|
@item a == b
|
|
string comparison (false)
|
|
@end table
|
|
|
|
In the next example:
|
|
|
|
@example
|
|
$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'
|
|
@print{} false
|
|
@end example
|
|
|
|
@cindex comparison expressions, string vs. regexp
|
|
@c @cindex string comparison vs. regexp comparison
|
|
@c @cindex regexp comparison vs. string comparison
|
|
@noindent
|
|
the result is @samp{false} because both @code{$1} and @code{$2}
|
|
are user input. They are numeric strings---therefore both have
|
|
the @var{strnum} attribute, dictating a numeric comparison.
|
|
The purpose of the comparison rules and the use of numeric strings is
|
|
to attempt to produce the behavior that is ``least surprising,'' while
|
|
still ``doing the right thing.''
|
|
String comparisons and regular expression comparisons are very different.
|
|
For example:
|
|
|
|
@example
|
|
x == "foo"
|
|
@end example
|
|
|
|
@noindent
|
|
has the value one, or is true if the variable @code{x}
|
|
is precisely @samp{foo}. By contrast:
|
|
|
|
@example
|
|
x ~ /foo/
|
|
@end example
|
|
|
|
@noindent
|
|
has the value one if @code{x} contains @samp{foo}, such as
|
|
@code{"Oh, what a fool am I!"}.
|
|
|
|
@cindex @code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
The righthand operand of the @samp{~} and @samp{!~} operators may be
|
|
either a regexp constant (@code{/@dots{}/}) or an ordinary
|
|
expression. In the latter case, the value of the expression as a string is used as a
|
|
dynamic regexp (@pxref{Regexp Usage}; also
|
|
@pxref{Computed Regexps}).
|
|
|
|
@cindex @command{awk}, regexp constants and
|
|
@cindex regexp constants
|
|
In modern implementations of @command{awk}, a constant regular
|
|
expression in slashes by itself is also an expression. The regexp
|
|
@code{/@var{regexp}/} is an abbreviation for the following comparison expression:
|
|
|
|
@example
|
|
$0 ~ /@var{regexp}/
|
|
@end example
|
|
|
|
One special place where @code{/foo/} is @emph{not} an abbreviation for
|
|
@samp{$0 ~ /foo/} is when it is the righthand operand of @samp{~} or
|
|
@samp{!~}.
|
|
@xref{Using Constant Regexps},
|
|
where this is discussed in more detail.
|
|
@c ENDOFRANGE comex
|
|
@c ENDOFRANGE excom
|
|
@c ENDOFRANGE vartypc
|
|
@c ENDOFRANGE varting
|
|
|
|
@node Boolean Ops
|
|
@section Boolean Expressions
|
|
@cindex and Boolean-logic operator
|
|
@cindex or Boolean-logic operator
|
|
@cindex not Boolean-logic operator
|
|
@c STARTOFRANGE exbo
|
|
@cindex expressions, Boolean
|
|
@c STARTOFRANGE boex
|
|
@cindex Boolean expressions
|
|
@cindex operators, Boolean, See Boolean expressions
|
|
@cindex Boolean operators, See Boolean expressions
|
|
@cindex logical operators, See Boolean expressions
|
|
@cindex operators, logical, See Boolean expressions
|
|
|
|
A @dfn{Boolean expression} is a combination of comparison expressions or
|
|
matching expressions, using the Boolean operators ``or''
|
|
(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with
|
|
parentheses to control nesting. The truth value of the Boolean expression is
|
|
computed by combining the truth values of the component expressions.
|
|
Boolean expressions are also referred to as @dfn{logical expressions}.
|
|
The terms are equivalent.
|
|
|
|
Boolean expressions can be used wherever comparison and matching
|
|
expressions can be used. They can be used in @code{if}, @code{while},
|
|
@code{do}, and @code{for} statements
|
|
(@pxref{Statements}).
|
|
They have numeric values (one if true, zero if false) that come into play
|
|
if the result of the Boolean expression is stored in a variable or
|
|
used in arithmetic.
|
|
|
|
In addition, every Boolean expression is also a valid pattern, so
|
|
you can use one as a pattern to control the execution of rules.
|
|
The Boolean operators are:
|
|
|
|
@table @code
|
|
@item @var{boolean1} && @var{boolean2}
|
|
True if both @var{boolean1} and @var{boolean2} are true. For example,
|
|
the following statement prints the current input record if it contains
|
|
both @samp{2400} and @samp{foo}:
|
|
|
|
@example
|
|
if ($0 ~ /2400/ && $0 ~ /foo/) print
|
|
@end example
|
|
|
|
@cindex side effects, Boolean operators
|
|
The subexpression @var{boolean2} is evaluated only if @var{boolean1}
|
|
is true. This can make a difference when @var{boolean2} contains
|
|
expressions that have side effects. In the case of @samp{$0 ~ /foo/ &&
|
|
($2 == bar++)}, the variable @code{bar} is not incremented if there is
|
|
no substring @samp{foo} in the record.
|
|
|
|
@item @var{boolean1} || @var{boolean2}
|
|
True if at least one of @var{boolean1} or @var{boolean2} is true.
|
|
For example, the following statement prints all records in the input
|
|
that contain @emph{either} @samp{2400} or
|
|
@samp{foo} or both:
|
|
|
|
@example
|
|
if ($0 ~ /2400/ || $0 ~ /foo/) print
|
|
@end example
|
|
|
|
The subexpression @var{boolean2} is evaluated only if @var{boolean1}
|
|
is false. This can make a difference when @var{boolean2} contains
|
|
expressions that have side effects.
|
|
|
|
@item ! @var{boolean}
|
|
True if @var{boolean} is false. For example,
|
|
the following program prints @samp{no home!} in
|
|
the unusual event that the @env{HOME} environment
|
|
variable is not defined:
|
|
|
|
@example
|
|
BEGIN @{ if (! ("HOME" in ENVIRON))
|
|
print "no home!" @}
|
|
@end example
|
|
|
|
(The @code{in} operator is described in
|
|
@ref{Reference to Elements}.)
|
|
@end table
|
|
|
|
@cindex short-circuit operators
|
|
@cindex operators, short-circuit
|
|
@cindex @code{&} (ampersand), @code{&&} operator
|
|
@cindex ampersand (@code{&}), @code{&&} operator
|
|
@cindex @code{|} (vertical bar), @code{||} operator
|
|
@cindex vertical bar (@code{|}), @code{||} operator
|
|
The @samp{&&} and @samp{||} operators are called @dfn{short-circuit}
|
|
operators because of the way they work. Evaluation of the full expression
|
|
is ``short-circuited'' if the result can be determined part way through
|
|
its evaluation.
|
|
|
|
@cindex line continuations
|
|
Statements that use @samp{&&} or @samp{||} can be continued simply
|
|
by putting a newline after them. But you cannot put a newline in front
|
|
of either of these operators without using backslash continuation
|
|
(@pxref{Statements/Lines}).
|
|
|
|
@cindex @code{!} (exclamation point), @code{!} operator
|
|
@cindex exclamation point (@code{!}), @code{!} operator
|
|
@cindex newlines
|
|
@cindex variables, flag
|
|
@cindex flag variables
|
|
The actual value of an expression using the @samp{!} operator is
|
|
either one or zero, depending upon the truth value of the expression it
|
|
is applied to.
|
|
The @samp{!} operator is often useful for changing the sense of a flag
|
|
variable from false to true and back again. For example, the following
|
|
program is one way to print lines in between special bracketing lines:
|
|
|
|
@example
|
|
$1 == "START" @{ interested = ! interested; next @}
|
|
interested == 1 @{ print @}
|
|
$1 == "END" @{ interested = ! interested; next @}
|
|
@end example
|
|
|
|
@noindent
|
|
The variable @code{interested}, as with all @command{awk} variables, starts
|
|
out initialized to zero, which is also false. When a line is seen whose
|
|
first field is @samp{START}, the value of @code{interested} is toggled
|
|
to true, using @samp{!}. The next rule prints lines as long as
|
|
@code{interested} is true. When a line is seen whose first field is
|
|
@samp{END}, @code{interested} is toggled back to false.
|
|
|
|
@ignore
|
|
Scott Deifik points out that this program isn't robust against
|
|
bogus input data, but the point is to illustrate the use of `!',
|
|
so we'll leave well enough alone.
|
|
@end ignore
|
|
|
|
@cindex @code{next} statement
|
|
@strong{Note:} The @code{next} statement is discussed in
|
|
@ref{Next Statement}.
|
|
@code{next} tells @command{awk} to skip the rest of the rules, get the
|
|
next record, and start processing the rules over again at the top.
|
|
The reason it's there is to avoid printing the bracketing
|
|
@samp{START} and @samp{END} lines.
|
|
@c ENDOFRANGE exbo
|
|
@c ENDOFRANGE boex
|
|
|
|
@node Conditional Exp
|
|
@section Conditional Expressions
|
|
@cindex conditional expressions
|
|
@cindex expressions, conditional
|
|
@cindex expressions, selecting
|
|
|
|
A @dfn{conditional expression} is a special kind of expression that has
|
|
three operands. It allows you to use one expression's value to select
|
|
one of two other expressions.
|
|
The conditional expression is the same as in the C language,
|
|
as shown here:
|
|
|
|
@example
|
|
@var{selector} ? @var{if-true-exp} : @var{if-false-exp}
|
|
@end example
|
|
|
|
@noindent
|
|
There are three subexpressions. The first, @var{selector}, is always
|
|
computed first. If it is ``true'' (not zero or not null), then
|
|
@var{if-true-exp} is computed next and its value becomes the value of
|
|
the whole expression. Otherwise, @var{if-false-exp} is computed next
|
|
and its value becomes the value of the whole expression.
|
|
For example, the following expression produces the absolute value of @code{x}:
|
|
|
|
@example
|
|
x >= 0 ? x : -x
|
|
@end example
|
|
|
|
@cindex side effects, conditional expressions
|
|
Each time the conditional expression is computed, only one of
|
|
@var{if-true-exp} and @var{if-false-exp} is used; the other is ignored.
|
|
This is important when the expressions have side effects. For example,
|
|
this conditional expression examines element @code{i} of either array
|
|
@code{a} or array @code{b}, and increments @code{i}:
|
|
|
|
@example
|
|
x == y ? a[i++] : b[i++]
|
|
@end example
|
|
|
|
@noindent
|
|
This is guaranteed to increment @code{i} exactly once, because each time
|
|
only one of the two increment expressions is executed
|
|
and the other is not.
|
|
@xref{Arrays},
|
|
for more information about arrays.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, line continuations
|
|
@cindex line continuations, @command{gawk}
|
|
@cindex @command{gawk}, line continuation in
|
|
As a minor @command{gawk} extension,
|
|
a statement that uses @samp{?:} can be continued simply
|
|
by putting a newline after either character.
|
|
However, putting a newline in front
|
|
of either character does not work without using backslash continuation
|
|
(@pxref{Statements/Lines}).
|
|
If @option{--posix} is specified
|
|
(@pxref{Options}), then this extension is disabled.
|
|
|
|
@node Function Calls
|
|
@section Function Calls
|
|
@cindex function calls
|
|
|
|
A @dfn{function} is a name for a particular calculation.
|
|
This enables you to
|
|
ask for it by name at any point in the program. For
|
|
example, the function @code{sqrt} computes the square root of a number.
|
|
|
|
@cindex functions, built-in
|
|
A fixed set of functions are @dfn{built-in}, which means they are
|
|
available in every @command{awk} program. The @code{sqrt} function is one
|
|
of these. @xref{Built-in}, for a list of built-in
|
|
functions and their descriptions. In addition, you can define
|
|
functions for use in your program.
|
|
@xref{User-defined},
|
|
for instructions on how to do this.
|
|
|
|
@cindex arguments, in function calls
|
|
The way to use a function is with a @dfn{function call} expression,
|
|
which consists of the function name followed immediately by a list of
|
|
@dfn{arguments} in parentheses. The arguments are expressions that
|
|
provide the raw materials for the function's calculations.
|
|
When there is more than one argument, they are separated by commas. If
|
|
there are no arguments, just write @samp{()} after the function name.
|
|
The following examples show function calls with and without arguments:
|
|
|
|
@example
|
|
sqrt(x^2 + y^2) @i{one argument}
|
|
atan2(y, x) @i{two arguments}
|
|
rand() @i{no arguments}
|
|
@end example
|
|
|
|
@cindex troubleshooting, function call syntax
|
|
@strong{Caution:}
|
|
Do not put any space between the function name and the open-parenthesis!
|
|
A user-defined function name looks just like the name of a
|
|
variable---a space would make the expression look like concatenation of
|
|
a variable with an expression inside parentheses.
|
|
|
|
With built-in functions, space before the parenthesis is harmless, but
|
|
it is best not to get into the habit of using space to avoid mistakes
|
|
with user-defined functions. Each function expects a particular number
|
|
of arguments. For example, the @code{sqrt} function must be called with
|
|
a single argument, the number of which to take the square root:
|
|
|
|
@example
|
|
sqrt(@var{argument})
|
|
@end example
|
|
|
|
Some of the built-in functions have one or
|
|
more optional arguments.
|
|
If those arguments are not supplied, the functions
|
|
use a reasonable default value.
|
|
@xref{Built-in}, for full details. If arguments
|
|
are omitted in calls to user-defined functions, then those arguments are
|
|
treated as local variables and initialized to the empty string
|
|
(@pxref{User-defined}).
|
|
|
|
@cindex side effects, function calls
|
|
Like every other expression, the function call has a value, which is
|
|
computed by the function based on the arguments you give it. In this
|
|
example, the value of @samp{sqrt(@var{argument})} is the square root of
|
|
@var{argument}. A function can also have side effects, such as assigning
|
|
values to certain variables or doing I/O.
|
|
The following program reads numbers, one number per line, and prints the
|
|
square root of each one:
|
|
|
|
@example
|
|
$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}'
|
|
1
|
|
@print{} The square root of 1 is 1
|
|
3
|
|
@print{} The square root of 3 is 1.73205
|
|
5
|
|
@print{} The square root of 5 is 2.23607
|
|
@kbd{@value{CTL}-d}
|
|
@end example
|
|
|
|
@node Precedence
|
|
@section Operator Precedence (How Operators Nest)
|
|
@c STARTOFRANGE prec
|
|
@cindex precedence
|
|
@c STARTOFRANGE oppr
|
|
@cindex operators, precedence
|
|
|
|
@dfn{Operator precedence} determines how operators are grouped when
|
|
different operators appear close by in one expression. For example,
|
|
@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c}
|
|
means to multiply @code{b} and @code{c}, and then add @code{a} to the
|
|
product (i.e., @samp{a + (b * c)}).
|
|
|
|
The normal precedence of the operators can be overruled by using parentheses.
|
|
Think of the precedence rules as saying where the
|
|
parentheses are assumed to be. In
|
|
fact, it is wise to always use parentheses whenever there is an unusual
|
|
combination of operators, because other people who read the program may
|
|
not remember what the precedence is in this case.
|
|
Even experienced programmers occasionally forget the exact rules,
|
|
which leads to mistakes.
|
|
Explicit parentheses help prevent
|
|
any such mistakes.
|
|
|
|
When operators of equal precedence are used together, the leftmost
|
|
operator groups first, except for the assignment, conditional, and
|
|
exponentiation operators, which group in the opposite order.
|
|
Thus, @samp{a - b + c} groups as @samp{(a - b) + c} and
|
|
@samp{a = b = c} groups as @samp{a = (b = c)}.
|
|
|
|
The precedence of prefix unary operators does not matter as long as only
|
|
unary operators are involved, because there is only one way to interpret
|
|
them: innermost first. Thus, @samp{$++i} means @samp{$(++i)} and
|
|
@samp{++$x} means @samp{++($x)}. However, when another operator follows
|
|
the operand, then the precedence of the unary operators can matter.
|
|
@samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means
|
|
@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^},
|
|
whereas @samp{$} has higher precedence.
|
|
This table presents @command{awk}'s operators, in order of highest
|
|
to lowest precedence:
|
|
|
|
@c use @code in the items, looks better in TeX w/o all the quotes
|
|
@table @code
|
|
@item (@dots{})
|
|
Grouping.
|
|
|
|
@cindex @code{$} (dollar sign), @code{$} field operator
|
|
@cindex dollar sign (@code{$}), @code{$} field operator
|
|
@item $
|
|
Field.
|
|
|
|
@cindex @code{+} (plus sign), @code{++} operator
|
|
@cindex plus sign (@code{+}), @code{++} operator
|
|
@cindex @code{-} (hyphen), @code{--} (decrement/increment) operator
|
|
@cindex hyphen (@code{-}), @code{--} (decrement/increment) operators
|
|
@item ++ --
|
|
Increment, decrement.
|
|
|
|
@cindex @code{^} (caret), @code{^} operator
|
|
@cindex caret (@code{^}), @code{^} operator
|
|
@cindex @code{*} (asterisk), @code{**} operator
|
|
@cindex asterisk (@code{*}), @code{**} operator
|
|
@item ^ **
|
|
Exponentiation. These operators group right-to-left.
|
|
|
|
@cindex @code{+} (plus sign), @code{+} operator
|
|
@cindex plus sign (@code{+}), @code{+} operator
|
|
@cindex @code{-} (hyphen), @code{-} operator
|
|
@cindex hyphen (@code{-}), @code{-} operator
|
|
@cindex @code{!} (exclamation point), @code{!} operator
|
|
@cindex exclamation point (@code{!}), @code{!} operator
|
|
@item + - !
|
|
Unary plus, minus, logical ``not.''
|
|
|
|
@cindex @code{*} (asterisk), @code{*} operator, as multiplication operator
|
|
@cindex asterisk (@code{*}), @code{*} operator, as multiplication operator
|
|
@cindex @code{/} (forward slash), @code{/} operator
|
|
@cindex forward slash (@code{/}), @code{/} operator
|
|
@cindex @code{%} (percent sign), @code{%} operator
|
|
@cindex percent sign (@code{%}), @code{%} operator
|
|
@item * / %
|
|
Multiplication, division, modulus.
|
|
|
|
@cindex @code{+} (plus sign), @code{+} operator
|
|
@cindex plus sign (@code{+}), @code{+} operator
|
|
@cindex @code{-} (hyphen), @code{-} operator
|
|
@cindex hyphen (@code{-}), @code{-} operator
|
|
@item + -
|
|
Addition, subtraction.
|
|
|
|
@item @r{String Concatenation}
|
|
No special symbol is used to indicate concatenation.
|
|
The operands are simply written side by side
|
|
(@pxref{Concatenation}).
|
|
|
|
@cindex @code{<} (left angle bracket), @code{<} operator
|
|
@cindex left angle bracket (@code{<}), @code{<} operator
|
|
@cindex @code{<} (left angle bracket), @code{<=} operator
|
|
@cindex left angle bracket (@code{<}), @code{<=} operator
|
|
@cindex @code{>} (right angle bracket), @code{>=} operator
|
|
@cindex right angle bracket (@code{>}), @code{>=} operator
|
|
@cindex @code{>} (right angle bracket), @code{>} operator
|
|
@cindex right angle bracket (@code{>}), @code{>} operator
|
|
@cindex @code{=} (equals sign), @code{==} operator
|
|
@cindex equals sign (@code{=}), @code{==} operator
|
|
@cindex @code{!} (exclamation point), @code{!=} operator
|
|
@cindex exclamation point (@code{!}), @code{!=} operator
|
|
@cindex @code{>} (right angle bracket), @code{>>} operator (I/O)
|
|
@cindex right angle bracket (@code{>}), @code{>>} operator (I/O)
|
|
@cindex operators, input/output
|
|
@cindex @code{|} (vertical bar), @code{|} operator (I/O)
|
|
@cindex vertical bar (@code{|}), @code{|} operator (I/O)
|
|
@cindex operators, input/output
|
|
@cindex @code{|} (vertical bar), @code{|&} operator (I/O)
|
|
@cindex vertical bar (@code{|}), @code{|&} operator (I/O)
|
|
@cindex operators, input/output
|
|
@item < <= == !=
|
|
@itemx > >= >> | |&
|
|
Relational and redirection.
|
|
The relational operators and the redirections have the same precedence
|
|
level. Characters such as @samp{>} serve both as relationals and as
|
|
redirections; the context distinguishes between the two meanings.
|
|
|
|
@cindex @code{print} statement, I/O operators in
|
|
@cindex @code{printf} statement, I/O operators in
|
|
Note that the I/O redirection operators in @code{print} and @code{printf}
|
|
statements belong to the statement level, not to expressions. The
|
|
redirection does not produce an expression that could be the operand of
|
|
another operator. As a result, it does not make sense to use a
|
|
redirection operator near another operator of lower precedence without
|
|
parentheses. Such combinations (for example, @samp{print foo > a ? b : c}),
|
|
result in syntax errors.
|
|
The correct way to write this statement is @samp{print foo > (a ? b : c)}.
|
|
|
|
@cindex @code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
@item ~ !~
|
|
Matching, nonmatching.
|
|
|
|
@cindex @code{in} operator
|
|
@item in
|
|
Array membership.
|
|
|
|
@cindex @code{&} (ampersand), @code{&&} operator
|
|
@cindex ampersand (@code{&}), @code{&&}operator
|
|
@item &&
|
|
Logical ``and''.
|
|
|
|
@cindex @code{|} (vertical bar), @code{||} operator
|
|
@cindex vertical bar (@code{|}), @code{||} operator
|
|
@item ||
|
|
Logical ``or''.
|
|
|
|
@cindex @code{?} (question mark), @code{?:} operator
|
|
@cindex question mark (@code{?}), @code{?:} operator
|
|
@item ?:
|
|
Conditional. This operator groups right-to-left.
|
|
|
|
@cindex @code{+} (plus sign), @code{+=} operator
|
|
@cindex plus sign (@code{+}), @code{+=} operator
|
|
@cindex @code{-} (hyphen), @code{-=} operator
|
|
@cindex hyphen (@code{-}), @code{-=} operator
|
|
@cindex @code{*} (asterisk), @code{*=} operator
|
|
@cindex asterisk (@code{*}), @code{*=} operator
|
|
@cindex @code{*} (asterisk), @code{**=} operator
|
|
@cindex asterisk (@code{*}), @code{**=} operator
|
|
@cindex @code{/} (forward slash), @code{/=} operator
|
|
@cindex forward slash (@code{/}), @code{/=} operator
|
|
@cindex @code{%} (percent sign), @code{%=} operator
|
|
@cindex percent sign (@code{%}), @code{%=} operator
|
|
@cindex @code{^} (caret), @code{^=} operator
|
|
@cindex caret (@code{^}), @code{^=} operator
|
|
@item = += -= *=
|
|
@itemx /= %= ^= **=
|
|
Assignment. These operators group right to left.
|
|
@end table
|
|
|
|
@cindex portability, operators, not in POSIX @command{awk}
|
|
@strong{Note:}
|
|
The @samp{|&}, @samp{**}, and @samp{**=} operators are not specified by POSIX.
|
|
For maximum portability, do not use them.
|
|
@c ENDOFRANGE prec
|
|
@c ENDOFRANGE oppr
|
|
@c ENDOFRANGE exps
|
|
|
|
@node Patterns and Actions
|
|
@chapter Patterns, Actions, and Variables
|
|
@c STARTOFRANGE pat
|
|
@cindex patterns
|
|
|
|
As you have already seen, each @command{awk} statement consists of
|
|
a pattern with an associated action. This @value{CHAPTER} describes how
|
|
you build patterns and actions, what kinds of things you can do within
|
|
actions, and @command{awk}'s built-in variables.
|
|
|
|
The pattern-action rules and the statements available for use
|
|
within actions form the core of @command{awk} programming.
|
|
In a sense, everything covered
|
|
up to here has been the foundation
|
|
that programs are built on top of. Now it's time to start
|
|
building something useful.
|
|
|
|
@menu
|
|
* Pattern Overview:: What goes into a pattern.
|
|
* Using Shell Variables:: How to use shell variables with @command{awk}.
|
|
* Action Overview:: What goes into an action.
|
|
* Statements:: Describes the various control statements in
|
|
detail.
|
|
* Built-in Variables:: Summarizes the built-in variables.
|
|
@end menu
|
|
|
|
@node Pattern Overview
|
|
@section Pattern Elements
|
|
|
|
@menu
|
|
* Regexp Patterns:: Using regexps as patterns.
|
|
* Expression Patterns:: Any expression can be used as a pattern.
|
|
* Ranges:: Pairs of patterns specify record ranges.
|
|
* BEGIN/END:: Specifying initialization and cleanup rules.
|
|
* Empty:: The empty pattern, which matches every record.
|
|
@end menu
|
|
|
|
@cindex patterns, types of
|
|
Patterns in @command{awk} control the execution of rules---a rule is
|
|
executed when its pattern matches the current input record.
|
|
The following is a summary of the types of @command{awk} patterns:
|
|
|
|
@table @code
|
|
@item /@var{regular expression}/
|
|
A regular expression. It matches when the text of the
|
|
input record fits the regular expression.
|
|
(@xref{Regexp}.)
|
|
|
|
@item @var{expression}
|
|
A single expression. It matches when its value
|
|
is nonzero (if a number) or non-null (if a string).
|
|
(@xref{Expression Patterns}.)
|
|
|
|
@item @var{pat1}, @var{pat2}
|
|
A pair of patterns separated by a comma, specifying a range of records.
|
|
The range includes both the initial record that matches @var{pat1} and
|
|
the final record that matches @var{pat2}.
|
|
(@xref{Ranges}.)
|
|
|
|
@item BEGIN
|
|
@itemx END
|
|
Special patterns for you to supply startup or cleanup actions for your
|
|
@command{awk} program.
|
|
(@xref{BEGIN/END}.)
|
|
|
|
@item @var{empty}
|
|
The empty pattern matches every input record.
|
|
(@xref{Empty}.)
|
|
@end table
|
|
|
|
@node Regexp Patterns
|
|
@subsection Regular Expressions as Patterns
|
|
@cindex patterns, expressions as
|
|
@cindex regular expressions, as patterns
|
|
|
|
Regular expressions are one of the first kinds of patterns presented
|
|
in this book.
|
|
This kind of pattern is simply a regexp constant in the pattern part of
|
|
a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}.
|
|
The pattern matches when the input record matches the regexp.
|
|
For example:
|
|
|
|
@example
|
|
/foo|bar|baz/ @{ buzzwords++ @}
|
|
END @{ print buzzwords, "buzzwords seen" @}
|
|
@end example
|
|
|
|
@node Expression Patterns
|
|
@subsection Expressions as Patterns
|
|
@cindex expressions, as patterns
|
|
|
|
Any @command{awk} expression is valid as an @command{awk} pattern.
|
|
The pattern matches if the expression's value is nonzero (if a
|
|
number) or non-null (if a string).
|
|
The expression is reevaluated each time the rule is tested against a new
|
|
input record. If the expression uses fields such as @code{$1}, the
|
|
value depends directly on the new input record's text; otherwise, it
|
|
depends on only what has happened so far in the execution of the
|
|
@command{awk} program.
|
|
|
|
@cindex comparison expressions, as patterns
|
|
@cindex patterns, comparison expressions as
|
|
Comparison expressions, using the comparison operators described in
|
|
@ref{Typing and Comparison},
|
|
are a very common kind of pattern.
|
|
Regexp matching and nonmatching are also very common expressions.
|
|
The left operand of the @samp{~} and @samp{!~} operators is a string.
|
|
The right operand is either a constant regular expression enclosed in
|
|
slashes (@code{/@var{regexp}/}), or any expression whose string value
|
|
is used as a dynamic regular expression
|
|
(@pxref{Computed Regexps}).
|
|
The following example prints the second field of each input record
|
|
whose first field is precisely @samp{foo}:
|
|
|
|
@cindex @code{/} (forward slash), patterns and
|
|
@cindex forward slash (@code{/}), patterns and
|
|
@cindex @code{~} (tilde), @code{~} operator
|
|
@cindex tilde (@code{~}), @code{~} operator
|
|
@cindex @code{!} (exclamation point), @code{!~} operator
|
|
@cindex exclamation point (@code{!}), @code{!~} operator
|
|
@example
|
|
$ awk '$1 == "foo" @{ print $2 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
(There is no output, because there is no BBS site with the exact name @samp{foo}.)
|
|
Contrast this with the following regular expression match, which
|
|
accepts any record with a first field that contains @samp{foo}:
|
|
|
|
@example
|
|
$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list
|
|
@print{} 555-1234
|
|
@print{} 555-6699
|
|
@print{} 555-6480
|
|
@print{} 555-2127
|
|
@end example
|
|
|
|
@cindex regexp constants, as patterns
|
|
@cindex patterns, regexp constants as
|
|
A regexp constant as a pattern is also a special case of an expression
|
|
pattern. The expression @code{/foo/} has the value one if @samp{foo}
|
|
appears in the current input record. Thus, as a pattern, @code{/foo/}
|
|
matches any record containing @samp{foo}.
|
|
|
|
@cindex Boolean expressions, as patterns
|
|
Boolean expressions are also commonly used as patterns.
|
|
Whether the pattern
|
|
matches an input record depends on whether its subexpressions match.
|
|
For example, the following command prints all the records in
|
|
@file{BBS-list} that contain both @samp{2400} and @samp{foo}:
|
|
|
|
@example
|
|
$ awk '/2400/ && /foo/' BBS-list
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@end example
|
|
|
|
The following command prints all records in
|
|
@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}
|
|
(or both, of course):
|
|
|
|
@example
|
|
$ awk '/2400/ || /foo/' BBS-list
|
|
@print{} alpo-net 555-3412 2400/1200/300 A
|
|
@print{} bites 555-1675 2400/1200/300 A
|
|
@print{} fooey 555-1234 2400/1200/300 B
|
|
@print{} foot 555-6699 1200/300 B
|
|
@print{} macfoo 555-6480 1200/300 A
|
|
@print{} sdace 555-3430 2400/1200/300 A
|
|
@print{} sabafoo 555-2127 1200/300 C
|
|
@end example
|
|
|
|
The following command prints all records in
|
|
@file{BBS-list} that do @emph{not} contain the string @samp{foo}:
|
|
|
|
@example
|
|
$ awk '! /foo/' BBS-list
|
|
@print{} aardvark 555-5553 1200/300 B
|
|
@print{} alpo-net 555-3412 2400/1200/300 A
|
|
@print{} barfly 555-7685 1200/300 A
|
|
@print{} bites 555-1675 2400/1200/300 A
|
|
@print{} camelot 555-0542 300 C
|
|
@print{} core 555-2912 1200/300 C
|
|
@print{} sdace 555-3430 2400/1200/300 A
|
|
@end example
|
|
|
|
@cindex @code{BEGIN} pattern, Boolean patterns and
|
|
@cindex @code{END} pattern, Boolean patterns and
|
|
The subexpressions of a Boolean operator in a pattern can be constant regular
|
|
expressions, comparisons, or any other @command{awk} expressions. Range
|
|
patterns are not expressions, so they cannot appear inside Boolean
|
|
patterns. Likewise, the special patterns @code{BEGIN} and @code{END},
|
|
which never match any input record, are not expressions and cannot
|
|
appear inside Boolean patterns.
|
|
|
|
@node Ranges
|
|
@subsection Specifying Record Ranges with Patterns
|
|
|
|
@cindex range patterns
|
|
@cindex patterns, ranges in
|
|
@cindex lines, matching ranges of
|
|
@cindex @code{,} (comma), in range patterns
|
|
@cindex comma (@code{,}), in range patterns
|
|
A @dfn{range pattern} is made of two patterns separated by a comma, in
|
|
the form @samp{@var{begpat}, @var{endpat}}. It is used to match ranges of
|
|
consecutive input records. The first pattern, @var{begpat}, controls
|
|
where the range begins, while @var{endpat} controls where
|
|
the pattern ends. For example, the following:
|
|
|
|
@example
|
|
awk '$1 == "on", $1 == "off"' myfile
|
|
@end example
|
|
|
|
@noindent
|
|
prints every record in @file{myfile} between @samp{on}/@samp{off} pairs, inclusive.
|
|
|
|
A range pattern starts out by matching @var{begpat} against every
|
|
input record. When a record matches @var{begpat}, the range pattern is
|
|
@dfn{turned on} and the range pattern matches this record as well. As long as
|
|
the range pattern stays turned on, it automatically matches every input
|
|
record read. The range pattern also matches @var{endpat} against every
|
|
input record; when this succeeds, the range pattern is turned off again
|
|
for the following record. Then the range pattern goes back to checking
|
|
@var{begpat} against each record.
|
|
|
|
@c last comma does NOT start a tertiary
|
|
@cindex @code{if} statement, actions, changing
|
|
The record that turns on the range pattern and the one that turns it
|
|
off both match the range pattern. If you don't want to operate on
|
|
these records, you can write @code{if} statements in the rule's action
|
|
to distinguish them from the records you are interested in.
|
|
|
|
It is possible for a pattern to be turned on and off by the same
|
|
record. If the record satisfies both conditions, then the action is
|
|
executed for just that record.
|
|
For example, suppose there is text between two identical markers (e.g.,
|
|
the @samp{%} symbol), each on its own line, that should be ignored.
|
|
A first attempt would be to
|
|
combine a range pattern that describes the delimited text with the
|
|
@code{next} statement
|
|
(not discussed yet, @pxref{Next Statement}).
|
|
This causes @command{awk} to skip any further processing of the current
|
|
record and start over again with the next input record. Such a program
|
|
looks like this:
|
|
|
|
@example
|
|
/^%$/,/^%$/ @{ next @}
|
|
@{ print @}
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex lines, skipping between markers
|
|
@c @cindex flag variables
|
|
This program fails because the range pattern is both turned on and turned off
|
|
by the first line, which just has a @samp{%} on it. To accomplish this task,
|
|
write the program in the following manner, using a flag:
|
|
|
|
@cindex @code{!} operator
|
|
@example
|
|
/^%$/ @{ skip = ! skip; next @}
|
|
skip == 1 @{ next @} # skip lines with `skip' set
|
|
@end example
|
|
|
|
In a range pattern, the comma (@samp{,}) has the lowest precedence of
|
|
all the operators (i.e., it is evaluated last). Thus, the following
|
|
program attempts to combine a range pattern with another, simpler test:
|
|
|
|
@example
|
|
echo Yes | awk '/1/,/2/ || /Yes/'
|
|
@end example
|
|
|
|
The intent of this program is @samp{(/1/,/2/) || /Yes/}.
|
|
However, @command{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}.
|
|
This cannot be changed or worked around; range patterns do not combine
|
|
with other patterns:
|
|
|
|
@example
|
|
$ echo Yes | gawk '(/1/,/2/) || /Yes/'
|
|
@error{} gawk: cmd. line:1: (/1/,/2/) || /Yes/
|
|
@error{} gawk: cmd. line:1: ^ parse error
|
|
@error{} gawk: cmd. line:2: (/1/,/2/) || /Yes/
|
|
@error{} gawk: cmd. line:2: ^ unexpected newline
|
|
@end example
|
|
|
|
@node BEGIN/END
|
|
@subsection The @code{BEGIN} and @code{END} Special Patterns
|
|
|
|
@c STARTOFRANGE beg
|
|
@cindex @code{BEGIN} pattern
|
|
@c STARTOFRANGE end
|
|
@cindex @code{END} pattern
|
|
All the patterns described so far are for matching input records.
|
|
The @code{BEGIN} and @code{END} special patterns are different.
|
|
They supply startup and cleanup actions for @command{awk} programs.
|
|
@code{BEGIN} and @code{END} rules must have actions; there is no default
|
|
action for these rules because there is no current record when they run.
|
|
@code{BEGIN} and @code{END} rules are often referred to as
|
|
``@code{BEGIN} and @code{END} blocks'' by long-time @command{awk}
|
|
programmers.
|
|
|
|
@menu
|
|
* Using BEGIN/END:: How and why to use BEGIN/END rules.
|
|
* I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
|
|
@end menu
|
|
|
|
@node Using BEGIN/END
|
|
@subsubsection Startup and Cleanup Actions
|
|
|
|
A @code{BEGIN} rule is executed once only, before the first input record
|
|
is read. Likewise, an @code{END} rule is executed once only, after all the
|
|
input is read. For example:
|
|
|
|
@example
|
|
$ awk '
|
|
> BEGIN @{ print "Analysis of \"foo\"" @}
|
|
> /foo/ @{ ++n @}
|
|
> END @{ print "\"foo\" appears", n, "times." @}' BBS-list
|
|
@print{} Analysis of "foo"
|
|
@print{} "foo" appears 4 times.
|
|
@end example
|
|
|
|
@cindex @code{BEGIN} pattern, operators and
|
|
@cindex @code{END} pattern, operators and
|
|
This program finds the number of records in the input file @file{BBS-list}
|
|
that contain the string @samp{foo}. The @code{BEGIN} rule prints a title
|
|
for the report. There is no need to use the @code{BEGIN} rule to
|
|
initialize the counter @code{n} to zero, since @command{awk} does this
|
|
automatically (@pxref{Variables}).
|
|
The second rule increments the variable @code{n} every time a
|
|
record containing the pattern @samp{foo} is read. The @code{END} rule
|
|
prints the value of @code{n} at the end of the run.
|
|
|
|
The special patterns @code{BEGIN} and @code{END} cannot be used in ranges
|
|
or with Boolean operators (indeed, they cannot be used with any operators).
|
|
An @command{awk} program may have multiple @code{BEGIN} and/or @code{END}
|
|
rules. They are executed in the order in which they appear: all the @code{BEGIN}
|
|
rules at startup and all the @code{END} rules at termination.
|
|
@code{BEGIN} and @code{END} rules may be intermixed with other rules.
|
|
This feature was added in the 1987 version of @command{awk} and is included
|
|
in the POSIX standard.
|
|
The original (1978) version of @command{awk}
|
|
required the @code{BEGIN} rule to be placed at the beginning of the
|
|
program, the @code{END} rule to be placed at the end, and only allowed one of
|
|
each.
|
|
This is no longer required, but it is a good idea to follow this template
|
|
in terms of program organization and readability.
|
|
|
|
Multiple @code{BEGIN} and @code{END} rules are useful for writing
|
|
library functions, because each library file can have its own @code{BEGIN} and/or
|
|
@code{END} rule to do its own initialization and/or cleanup.
|
|
The order in which library functions are named on the command line
|
|
controls the order in which their @code{BEGIN} and @code{END} rules are
|
|
executed. Therefore, you have to be careful when writing such rules in
|
|
library files so that the order in which they are executed doesn't matter.
|
|
@xref{Options}, for more information on
|
|
using library functions.
|
|
@xref{Library Functions},
|
|
for a number of useful library functions.
|
|
|
|
If an @command{awk} program has only a @code{BEGIN} rule and no
|
|
other rules, then the program exits after the @code{BEGIN} rule is
|
|
run.@footnote{The original version of @command{awk} used to keep
|
|
reading and ignoring input until the end of the file was seen.} However, if an
|
|
@code{END} rule exists, then the input is read, even if there are
|
|
no other rules in the program. This is necessary in case the @code{END}
|
|
rule checks the @code{FNR} and @code{NR} variables.
|
|
|
|
@node I/O And BEGIN/END
|
|
@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules
|
|
|
|
@cindex input/output, from @code{BEGIN} and @code{END}
|
|
There are several (sometimes subtle) points to remember when doing I/O
|
|
from a @code{BEGIN} or @code{END} rule.
|
|
The first has to do with the value of @code{$0} in a @code{BEGIN}
|
|
rule. Because @code{BEGIN} rules are executed before any input is read,
|
|
there simply is no input record, and therefore no fields, when
|
|
executing @code{BEGIN} rules. References to @code{$0} and the fields
|
|
yield a null string or zero, depending upon the context. One way
|
|
to give @code{$0} a real value is to execute a @code{getline} command
|
|
without a variable (@pxref{Getline}).
|
|
Another way is simply to assign a value to @code{$0}.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{BEGIN}/@code{END} patterns
|
|
@cindex POSIX @command{awk}, @code{BEGIN}/@code{END} patterns
|
|
@cindex @code{print} statement, @code{BEGIN}/@code{END} patterns and
|
|
@cindex @code{BEGIN} pattern, @code{print} statement and
|
|
@cindex @code{END} pattern, @code{print} statement and
|
|
The second point is similar to the first but from the other direction.
|
|
Traditionally, due largely to implementation issues, @code{$0} and
|
|
@code{NF} were @emph{undefined} inside an @code{END} rule.
|
|
The POSIX standard specifies that @code{NF} is available in an @code{END}
|
|
rule. It contains the number of fields from the last input record.
|
|
Most probably due to an oversight, the standard does not say that @code{$0}
|
|
is also preserved, although logically one would think that it should be.
|
|
In fact, @command{gawk} does preserve the value of @code{$0} for use in
|
|
@code{END} rules. Be aware, however, that Unix @command{awk}, and possibly
|
|
other implementations, do not.
|
|
|
|
The third point follows from the first two. The meaning of @samp{print}
|
|
inside a @code{BEGIN} or @code{END} rule is the same as always:
|
|
@samp{print $0}. If @code{$0} is the null string, then this prints an
|
|
empty line. Many long time @command{awk} programmers use an unadorned
|
|
@samp{print} in @code{BEGIN} and @code{END} rules, to mean @samp{@w{print ""}},
|
|
relying on @code{$0} being null. Although one might generally get away with
|
|
this in @code{BEGIN} rules, it is a very bad idea in @code{END} rules,
|
|
at least in @command{gawk}. It is also poor style, since if an empty
|
|
line is needed in the output, the program should print one explicitly.
|
|
|
|
@cindex @code{next} statement, @code{BEGIN}/@code{END} patterns and
|
|
@cindex @code{nextfile} statement, @code{BEGIN}/@code{END} patterns and
|
|
@cindex @code{BEGIN} pattern, @code{next}/@code{nextfile} statements and
|
|
@cindex @code{END} pattern, @code{next}/@code{nextfile} statements and
|
|
Finally, the @code{next} and @code{nextfile} statements are not allowed
|
|
in a @code{BEGIN} rule, because the implicit
|
|
read-a-record-and-match-against-the-rules loop has not started yet. Similarly, those statements
|
|
are not valid in an @code{END} rule, since all the input has been read.
|
|
(@xref{Next Statement}, and see
|
|
@ref{Nextfile Statement}.)
|
|
@c ENDOFRANGE beg
|
|
@c ENDOFRANGE end
|
|
|
|
@node Empty
|
|
@subsection The Empty Pattern
|
|
|
|
@cindex empty pattern
|
|
@cindex patterns, empty
|
|
An empty (i.e., nonexistent) pattern is considered to match @emph{every}
|
|
input record. For example, the program:
|
|
|
|
@example
|
|
awk '@{ print $1 @}' BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
prints the first field of every record.
|
|
@c ENDOFRANGE pat
|
|
|
|
@node Using Shell Variables
|
|
@section Using Shell Variables in Programs
|
|
@cindex shells, variables
|
|
@cindex @command{awk} programs, shell variables in
|
|
@c @cindex shell and @command{awk} interaction
|
|
|
|
@command{awk} programs are often used as components in larger
|
|
programs written in shell.
|
|
For example, it is very common to use a shell variable to
|
|
hold a pattern that the @command{awk} program searches for.
|
|
There are two ways to get the value of the shell variable
|
|
into the body of the @command{awk} program.
|
|
|
|
@cindex shells, quoting
|
|
The most common method is to use shell quoting to substitute
|
|
the variable's value into the program inside the script.
|
|
For example, in the following program:
|
|
|
|
@example
|
|
echo -n "Enter search pattern: "
|
|
read pattern
|
|
awk "/$pattern/ "'@{ nmatches++ @}
|
|
END @{ print nmatches, "found" @}' /path/to/data
|
|
@end example
|
|
|
|
@noindent
|
|
the @command{awk} program consists of two pieces of quoted text
|
|
that are concatenated together to form the program.
|
|
The first part is double-quoted, which allows substitution of
|
|
the @code{pattern} variable inside the quotes.
|
|
The second part is single-quoted.
|
|
|
|
Variable substitution via quoting works, but can be potentially
|
|
messy. It requires a good understanding of the shell's quoting rules
|
|
(@pxref{Quoting}),
|
|
and it's often difficult to correctly
|
|
match up the quotes when reading the program.
|
|
|
|
A better method is to use @command{awk}'s variable assignment feature
|
|
(@pxref{Assignment Options})
|
|
to assign the shell variable's value to an @command{awk} variable's
|
|
value. Then use dynamic regexps to match the pattern
|
|
(@pxref{Computed Regexps}).
|
|
The following shows how to redo the
|
|
previous example using this technique:
|
|
|
|
@example
|
|
echo -n "Enter search pattern: "
|
|
read pattern
|
|
awk -v pat="$pattern" '$0 ~ pat @{ nmatches++ @}
|
|
END @{ print nmatches, "found" @}' /path/to/data
|
|
@end example
|
|
|
|
@noindent
|
|
Now, the @command{awk} program is just one single-quoted string.
|
|
The assignment @samp{-v pat="$pattern"} still requires double quotes,
|
|
in case there is whitespace in the value of @code{$pattern}.
|
|
The @command{awk} variable @code{pat} could be named @code{pattern}
|
|
too, but that would be more confusing. Using a variable also
|
|
provides more flexibility, since the variable can be used anywhere inside
|
|
the program---for printing, as an array subscript, or for any other
|
|
use---without requiring the quoting tricks at every point in the program.
|
|
|
|
@node Action Overview
|
|
@section Actions
|
|
@c @cindex action, definition of
|
|
@c @cindex curly braces
|
|
@c @cindex action, curly braces
|
|
@c @cindex action, separating statements
|
|
@cindex actions
|
|
|
|
An @command{awk} program or script consists of a series of
|
|
rules and function definitions interspersed. (Functions are
|
|
described later. @xref{User-defined}.)
|
|
A rule contains a pattern and an action, either of which (but not
|
|
both) may be omitted. The purpose of the @dfn{action} is to tell
|
|
@command{awk} what to do once a match for the pattern is found. Thus,
|
|
in outline, an @command{awk} program generally looks like this:
|
|
|
|
@example
|
|
@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
|
|
@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
|
|
@dots{}
|
|
function @var{name}(@var{args}) @{ @dots{} @}
|
|
@dots{}
|
|
@end example
|
|
|
|
@cindex @code{@{@}} (braces), actions and
|
|
@cindex braces (@code{@{@}}), actions and
|
|
@cindex separators, for statements in actions
|
|
@cindex newlines, separating statements in actions
|
|
@cindex @code{;} (semicolon), separating statements in actions
|
|
@cindex semicolon (@code{;}), separating statements in actions
|
|
An action consists of one or more @command{awk} @dfn{statements}, enclosed
|
|
in curly braces (@samp{@{@dots{}@}}). Each statement specifies one
|
|
thing to do. The statements are separated by newlines or semicolons.
|
|
The curly braces around an action must be used even if the action
|
|
contains only one statement, or if it contains no statements at
|
|
all. However, if you omit the action entirely, omit the curly braces as
|
|
well. An omitted action is equivalent to @samp{@{ print $0 @}}:
|
|
|
|
@example
|
|
/foo/ @{ @} @i{match @code{foo}, do nothing --- empty action}
|
|
/foo/ @i{match @code{foo}, print the record --- omitted action}
|
|
@end example
|
|
|
|
The following types of statements are supported in @command{awk}:
|
|
|
|
@table @asis
|
|
@cindex side effects, statements
|
|
@item Expressions
|
|
Call functions or assign values to variables
|
|
(@pxref{Expressions}). Executing
|
|
this kind of statement simply computes the value of the expression.
|
|
This is useful when the expression has side effects
|
|
(@pxref{Assignment Ops}).
|
|
|
|
@item Control statements
|
|
Specify the control flow of @command{awk}
|
|
programs. The @command{awk} language gives you C-like constructs
|
|
(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few
|
|
special ones (@pxref{Statements}).
|
|
|
|
@item Compound statements
|
|
Consist of one or more statements enclosed in
|
|
curly braces. A compound statement is used in order to put several
|
|
statements together in the body of an @code{if}, @code{while}, @code{do},
|
|
or @code{for} statement.
|
|
|
|
@item Input statements
|
|
Use the @code{getline} command
|
|
(@pxref{Getline}).
|
|
Also supplied in @command{awk} are the @code{next}
|
|
statement (@pxref{Next Statement}),
|
|
and the @code{nextfile} statement
|
|
(@pxref{Nextfile Statement}).
|
|
|
|
@item Output statements
|
|
Such as @code{print} and @code{printf}.
|
|
@xref{Printing}.
|
|
|
|
@item Deletion statements
|
|
For deleting array elements.
|
|
@xref{Delete}.
|
|
@end table
|
|
|
|
@node Statements
|
|
@section Control Statements in Actions
|
|
@c STARTOFRANGE csta
|
|
@cindex control statements
|
|
@c STARTOFRANGE acs
|
|
@cindex statements, control, in actions
|
|
@c STARTOFRANGE accs
|
|
@cindex actions, control statements in
|
|
|
|
@dfn{Control statements}, such as @code{if}, @code{while}, and so on,
|
|
control the flow of execution in @command{awk} programs. Most of the
|
|
control statements in @command{awk} are patterned on similar statements in C.
|
|
|
|
@c the comma here does NOT start a secondary
|
|
@cindex compound statements, control statements and
|
|
@c the second comma here does NOT start a tertiary
|
|
@cindex statements, compound, control statements and
|
|
@cindex body, in actions
|
|
@cindex @code{@{@}} (braces), statements, grouping
|
|
@cindex braces (@code{@{@}}), statements, grouping
|
|
@cindex newlines, separating statements in actions
|
|
@cindex @code{;} (semicolon), separating statements in actions
|
|
@cindex semicolon (@code{;}), separating statements in actions
|
|
All the control statements start with special keywords, such as @code{if}
|
|
and @code{while}, to distinguish them from simple expressions.
|
|
Many control statements contain other statements. For example, the
|
|
@code{if} statement contains another statement that may or may not be
|
|
executed. The contained statement is called the @dfn{body}.
|
|
To include more than one statement in the body, group them into a
|
|
single @dfn{compound statement} with curly braces, separating them with
|
|
newlines or semicolons.
|
|
|
|
@menu
|
|
* If Statement:: Conditionally execute some @command{awk}
|
|
statements.
|
|
* While Statement:: Loop until some condition is satisfied.
|
|
* Do Statement:: Do specified action while looping until some
|
|
condition is satisfied.
|
|
* For Statement:: Another looping statement, that provides
|
|
initialization and increment clauses.
|
|
* Switch Statement:: Switch/case evaluation for conditional
|
|
execution of statements based on a value.
|
|
* Break Statement:: Immediately exit the innermost enclosing loop.
|
|
* Continue Statement:: Skip to the end of the innermost enclosing
|
|
loop.
|
|
* Next Statement:: Stop processing the current input record.
|
|
* Nextfile Statement:: Stop processing the current file.
|
|
* Exit Statement:: Stop execution of @command{awk}.
|
|
@end menu
|
|
|
|
@node If Statement
|
|
@subsection The @code{if}-@code{else} Statement
|
|
|
|
@cindex @code{if} statement
|
|
The @code{if}-@code{else} statement is @command{awk}'s decision-making
|
|
statement. It looks like this:
|
|
|
|
@example
|
|
if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]}
|
|
@end example
|
|
|
|
@noindent
|
|
The @var{condition} is an expression that controls what the rest of the
|
|
statement does. If the @var{condition} is true, @var{then-body} is
|
|
executed; otherwise, @var{else-body} is executed.
|
|
The @code{else} part of the statement is
|
|
optional. The condition is considered false if its value is zero or
|
|
the null string; otherwise, the condition is true.
|
|
Refer to the following:
|
|
|
|
@example
|
|
if (x % 2 == 0)
|
|
print "x is even"
|
|
else
|
|
print "x is odd"
|
|
@end example
|
|
|
|
In this example, if the expression @samp{x % 2 == 0} is true (that is,
|
|
if the value of @code{x} is evenly divisible by two), then the first
|
|
@code{print} statement is executed; otherwise, the second @code{print}
|
|
statement is executed.
|
|
If the @code{else} keyword appears on the same line as @var{then-body} and
|
|
@var{then-body} is not a compound statement (i.e., not surrounded by
|
|
curly braces), then a semicolon must separate @var{then-body} from
|
|
the @code{else}.
|
|
To illustrate this, the previous example can be rewritten as:
|
|
|
|
@example
|
|
if (x % 2 == 0) print "x is even"; else
|
|
print "x is odd"
|
|
@end example
|
|
|
|
@noindent
|
|
If the @samp{;} is left out, @command{awk} can't interpret the statement and
|
|
it produces a syntax error. Don't actually write programs this way,
|
|
because a human reader might fail to see the @code{else} if it is not
|
|
the first thing on its line.
|
|
|
|
@node While Statement
|
|
@subsection The @code{while} Statement
|
|
@cindex @code{while} statement
|
|
@cindex loops
|
|
@cindex loops, See Also @code{while} statement
|
|
|
|
In programming, a @dfn{loop} is a part of a program that can
|
|
be executed two or more times in succession.
|
|
The @code{while} statement is the simplest looping statement in
|
|
@command{awk}. It repeatedly executes a statement as long as a condition is
|
|
true. For example:
|
|
|
|
@example
|
|
while (@var{condition})
|
|
@var{body}
|
|
@end example
|
|
|
|
@cindex body, in loops
|
|
@noindent
|
|
@var{body} is a statement called the @dfn{body} of the loop,
|
|
and @var{condition} is an expression that controls how long the loop
|
|
keeps running.
|
|
The first thing the @code{while} statement does is test the @var{condition}.
|
|
If the @var{condition} is true, it executes the statement @var{body}.
|
|
@ifinfo
|
|
(The @var{condition} is true when the value
|
|
is not zero and not a null string.)
|
|
@end ifinfo
|
|
After @var{body} has been executed,
|
|
@var{condition} is tested again, and if it is still true, @var{body} is
|
|
executed again. This process repeats until the @var{condition} is no longer
|
|
true. If the @var{condition} is initially false, the body of the loop is
|
|
never executed and @command{awk} continues with the statement following
|
|
the loop.
|
|
This example prints the first three fields of each record, one per line:
|
|
|
|
@example
|
|
awk '@{ i = 1
|
|
while (i <= 3) @{
|
|
print $i
|
|
i++
|
|
@}
|
|
@}' inventory-shipped
|
|
@end example
|
|
|
|
@noindent
|
|
The body of this loop is a compound statement enclosed in braces,
|
|
containing two statements.
|
|
The loop works in the following manner: first, the value of @code{i} is set to one.
|
|
Then, the @code{while} statement tests whether @code{i} is less than or equal to
|
|
three. This is true when @code{i} equals one, so the @code{i}-th
|
|
field is printed. Then the @samp{i++} increments the value of @code{i}
|
|
and the loop repeats. The loop terminates when @code{i} reaches four.
|
|
|
|
A newline is not required between the condition and the
|
|
body; however using one makes the program clearer unless the body is a
|
|
compound statement or else is very simple. The newline after the open-brace
|
|
that begins the compound statement is not required either, but the
|
|
program is harder to read without it.
|
|
|
|
@node Do Statement
|
|
@subsection The @code{do}-@code{while} Statement
|
|
@cindex @code{do}-@code{while} statement
|
|
|
|
The @code{do} loop is a variation of the @code{while} looping statement.
|
|
The @code{do} loop executes the @var{body} once and then repeats the
|
|
@var{body} as long as the @var{condition} is true. It looks like this:
|
|
|
|
@example
|
|
do
|
|
@var{body}
|
|
while (@var{condition})
|
|
@end example
|
|
|
|
Even if the @var{condition} is false at the start, the @var{body} is
|
|
executed at least once (and only once, unless executing @var{body}
|
|
makes @var{condition} true). Contrast this with the corresponding
|
|
@code{while} statement:
|
|
|
|
@example
|
|
while (@var{condition})
|
|
@var{body}
|
|
@end example
|
|
|
|
@noindent
|
|
This statement does not execute @var{body} even once if the @var{condition}
|
|
is false to begin with.
|
|
The following is an example of a @code{do} statement:
|
|
|
|
@example
|
|
@{ i = 1
|
|
do @{
|
|
print $0
|
|
i++
|
|
@} while (i <= 10)
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
This program prints each input record 10 times. However, it isn't a very
|
|
realistic example, since in this case an ordinary @code{while} would do
|
|
just as well. This situation reflects actual experience; only
|
|
occasionally is there a real use for a @code{do} statement.
|
|
|
|
@node For Statement
|
|
@subsection The @code{for} Statement
|
|
@cindex @code{for} statement
|
|
|
|
The @code{for} statement makes it more convenient to count iterations of a
|
|
loop. The general form of the @code{for} statement looks like this:
|
|
|
|
@example
|
|
for (@var{initialization}; @var{condition}; @var{increment})
|
|
@var{body}
|
|
@end example
|
|
|
|
@noindent
|
|
The @var{initialization}, @var{condition}, and @var{increment} parts are
|
|
arbitrary @command{awk} expressions, and @var{body} stands for any
|
|
@command{awk} statement.
|
|
|
|
The @code{for} statement starts by executing @var{initialization}.
|
|
Then, as long
|
|
as the @var{condition} is true, it repeatedly executes @var{body} and then
|
|
@var{increment}. Typically, @var{initialization} sets a variable to
|
|
either zero or one, @var{increment} adds one to it, and @var{condition}
|
|
compares it against the desired number of iterations.
|
|
For example:
|
|
|
|
@example
|
|
awk '@{ for (i = 1; i <= 3; i++)
|
|
print $i
|
|
@}' inventory-shipped
|
|
@end example
|
|
|
|
@noindent
|
|
This prints the first three fields of each input record, with one field per
|
|
line.
|
|
|
|
It isn't possible to
|
|
set more than one variable in the
|
|
@var{initialization} part without using a multiple assignment statement
|
|
such as @samp{x = y = 0}. This makes sense only if all the initial values
|
|
are equal. (But it is possible to initialize additional variables by writing
|
|
their assignments as separate statements preceding the @code{for} loop.)
|
|
|
|
@c @cindex comma operator, not supported
|
|
The same is true of the @var{increment} part. Incrementing additional
|
|
variables requires separate statements at the end of the loop.
|
|
The C compound expression, using C's comma operator, is useful in
|
|
this context but it is not supported in @command{awk}.
|
|
|
|
Most often, @var{increment} is an increment expression, as in the previous
|
|
example. But this is not required; it can be any expression
|
|
whatsoever. For example, the following statement prints all the powers of two
|
|
between 1 and 100:
|
|
|
|
@example
|
|
for (i = 1; i <= 100; i *= 2)
|
|
print i
|
|
@end example
|
|
|
|
If there is nothing to be done, any of the three expressions in the
|
|
parentheses following the @code{for} keyword may be omitted. Thus,
|
|
@w{@samp{for (; x > 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the
|
|
@var{condition} is omitted, it is treated as true, effectively
|
|
yielding an @dfn{infinite loop} (i.e., a loop that never terminates).
|
|
|
|
In most cases, a @code{for} loop is an abbreviation for a @code{while}
|
|
loop, as shown here:
|
|
|
|
@example
|
|
@var{initialization}
|
|
while (@var{condition}) @{
|
|
@var{body}
|
|
@var{increment}
|
|
@}
|
|
@end example
|
|
|
|
@cindex loops, @code{continue} statements and
|
|
@noindent
|
|
The only exception is when the @code{continue} statement
|
|
(@pxref{Continue Statement}) is used
|
|
inside the loop. Changing a @code{for} statement to a @code{while}
|
|
statement in this way can change the effect of the @code{continue}
|
|
statement inside the loop.
|
|
|
|
The @command{awk} language has a @code{for} statement in addition to a
|
|
@code{while} statement because a @code{for} loop is often both less work to
|
|
type and more natural to think of. Counting the number of iterations is
|
|
very common in loops. It can be easier to think of this counting as part
|
|
of looping rather than as something to do inside the loop.
|
|
|
|
@ifinfo
|
|
@cindex @code{in} operator
|
|
There is an alternate version of the @code{for} loop, for iterating over
|
|
all the indices of an array:
|
|
|
|
@example
|
|
for (i in array)
|
|
@var{do something with} array[i]
|
|
@end example
|
|
|
|
@noindent
|
|
@xref{Scanning an Array},
|
|
for more information on this version of the @code{for} loop.
|
|
@end ifinfo
|
|
|
|
@node Switch Statement
|
|
@subsection The @code{switch} Statement
|
|
@cindex @code{switch} statement
|
|
@cindex @code{case} keyword
|
|
@cindex @code{default} keyword
|
|
|
|
@strong{NOTE:} This @value{SUBSECTION} describes an experimental feature
|
|
added in @command{gawk} 3.1.3. It is @emph{not} enabled by default. To
|
|
enable it, use the @option{--enable-switch} option to @command{configure}
|
|
when @command{gawk} is being configured and built.
|
|
@xref{Additional Configuration Options},
|
|
for more information.
|
|
|
|
The @code{switch} statement allows the evaluation of an expression and
|
|
the execution of statements based on a @code{case} match. Case statements
|
|
are checked for a match in the order they are defined. If no suitable
|
|
@code{case} is found, the @code{default} section is executed, if supplied. The
|
|
general form of the @code{switch} statement looks like this:
|
|
|
|
@example
|
|
switch (@var{expression}) @{
|
|
case @var{value or regular expression}:
|
|
@var{case-body}
|
|
default:
|
|
@var{default-body}
|
|
@}
|
|
@end example
|
|
|
|
The @code{switch} statement works as it does in C. Once a match to a given
|
|
case is made, case statement bodies are executed until a @code{break},
|
|
@code{continue}, @code{next}, @code{nextfile} or @code{exit} is encountered,
|
|
or the end of the @code{switch} statement itself. For example:
|
|
|
|
@example
|
|
switch (NR * 2 + 1) @{
|
|
case 3:
|
|
case "11":
|
|
print NR - 1
|
|
break
|
|
|
|
case /2[[:digit:]]+/:
|
|
print NR
|
|
|
|
default:
|
|
print NR + 1
|
|
|
|
case -1:
|
|
print NR * -1
|
|
@}
|
|
@end example
|
|
|
|
Note that if none of the statements specified above halt execution
|
|
of a matched @code{case} statement, execution falls through to the
|
|
next @code{case} until execution halts. In the above example, for
|
|
any case value starting with @samp{2} followed by one or more digits,
|
|
the @code{print} statement is executed and then falls through into the
|
|
@code{default} section, executing its @code{print} statement. In turn,
|
|
the @minus{}1 case will also be executed since the @code{default} does
|
|
not halt execution.
|
|
|
|
@node Break Statement
|
|
@subsection The @code{break} Statement
|
|
@cindex @code{break} statement
|
|
@cindex loops, exiting
|
|
|
|
The @code{break} statement jumps out of the innermost @code{for},
|
|
@code{while}, or @code{do} loop that encloses it. The following example
|
|
finds the smallest divisor of any integer, and also identifies prime
|
|
numbers:
|
|
|
|
@example
|
|
# find smallest divisor of num
|
|
@{
|
|
num = $1
|
|
for (div = 2; div*div <= num; div++)
|
|
if (num % div == 0)
|
|
break
|
|
if (num % div == 0)
|
|
printf "Smallest divisor of %d is %d\n", num, div
|
|
else
|
|
printf "%d is prime\n", num
|
|
@}
|
|
@end example
|
|
|
|
When the remainder is zero in the first @code{if} statement, @command{awk}
|
|
immediately @dfn{breaks out} of the containing @code{for} loop. This means
|
|
that @command{awk} proceeds immediately to the statement following the loop
|
|
and continues processing. (This is very different from the @code{exit}
|
|
statement, which stops the entire @command{awk} program.
|
|
@xref{Exit Statement}.)
|
|
|
|
Th following program illustrates how the @var{condition} of a @code{for}
|
|
or @code{while} statement could be replaced with a @code{break} inside
|
|
an @code{if}:
|
|
|
|
@example
|
|
# find smallest divisor of num
|
|
@{
|
|
num = $1
|
|
for (div = 2; ; div++) @{
|
|
if (num % div == 0) @{
|
|
printf "Smallest divisor of %d is %d\n", num, div
|
|
break
|
|
@}
|
|
if (div*div > num) @{
|
|
printf "%d is prime\n", num
|
|
break
|
|
@}
|
|
@}
|
|
@}
|
|
@end example
|
|
|
|
@c @cindex @code{break}, outside of loops
|
|
@c @cindex historical features
|
|
@c @cindex @command{awk} language, POSIX version
|
|
@cindex POSIX @command{awk}, @code{break} statement and
|
|
@cindex dark corner, @code{break} statement
|
|
@cindex @command{gawk}, @code{break} statement in
|
|
The @code{break} statement has no meaning when
|
|
used outside the body of a loop. However, although it was never documented,
|
|
historical implementations of @command{awk} treated the @code{break}
|
|
statement outside of a loop as if it were a @code{next} statement
|
|
(@pxref{Next Statement}).
|
|
Recent versions of Unix @command{awk} no longer allow this usage.
|
|
@command{gawk} supports this use of @code{break} only
|
|
if @option{--traditional}
|
|
has been specified on the command line
|
|
(@pxref{Options}).
|
|
Otherwise, it is treated as an error, since the POSIX standard
|
|
specifies that @code{break} should only be used inside the body of a
|
|
loop.
|
|
@value{DARKCORNER}
|
|
|
|
@node Continue Statement
|
|
@subsection The @code{continue} Statement
|
|
|
|
@cindex @code{continue} statement
|
|
As with @code{break}, the @code{continue} statement is used only inside
|
|
@code{for}, @code{while}, and @code{do} loops. It skips
|
|
over the rest of the loop body, causing the next cycle around the loop
|
|
to begin immediately. Contrast this with @code{break}, which jumps out
|
|
of the loop altogether.
|
|
|
|
The @code{continue} statement in a @code{for} loop directs @command{awk} to
|
|
skip the rest of the body of the loop and resume execution with the
|
|
increment-expression of the @code{for} statement. The following program
|
|
illustrates this fact:
|
|
|
|
@example
|
|
BEGIN @{
|
|
for (x = 0; x <= 20; x++) @{
|
|
if (x == 5)
|
|
continue
|
|
printf "%d ", x
|
|
@}
|
|
print ""
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
This program prints all the numbers from 0 to 20---except for 5, for
|
|
which the @code{printf} is skipped. Because the increment @samp{x++}
|
|
is not skipped, @code{x} does not remain stuck at 5. Contrast the
|
|
@code{for} loop from the previous example with the following @code{while} loop:
|
|
|
|
@example
|
|
BEGIN @{
|
|
x = 0
|
|
while (x <= 20) @{
|
|
if (x == 5)
|
|
continue
|
|
printf "%d ", x
|
|
x++
|
|
@}
|
|
print ""
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
This program loops forever once @code{x} reaches 5.
|
|
|
|
@c @cindex @code{continue}, outside of loops
|
|
@c @cindex historical features
|
|
@c @cindex @command{awk} language, POSIX version
|
|
@cindex POSIX @command{awk}, @code{continue} statement and
|
|
@cindex dark corner, @code{continue} statement
|
|
@cindex @command{gawk}, @code{continue} statement in
|
|
The @code{continue} statement has no meaning when used outside the body of
|
|
a loop. Historical versions of @command{awk} treated a @code{continue}
|
|
statement outside a loop the same way they treated a @code{break}
|
|
statement outside a loop: as if it were a @code{next}
|
|
statement
|
|
(@pxref{Next Statement}).
|
|
Recent versions of Unix @command{awk} no longer work this way, and
|
|
@command{gawk} allows it only if @option{--traditional} is specified on
|
|
the command line (@pxref{Options}). Just like the
|
|
@code{break} statement, the POSIX standard specifies that @code{continue}
|
|
should only be used inside the body of a loop.
|
|
@value{DARKCORNER}
|
|
|
|
@node Next Statement
|
|
@subsection The @code{next} Statement
|
|
@cindex @code{next} statement
|
|
|
|
The @code{next} statement forces @command{awk} to immediately stop processing
|
|
the current record and go on to the next record. This means that no
|
|
further rules are executed for the current record, and the rest of the
|
|
current rule's action isn't executed.
|
|
|
|
Contrast this with the effect of the @code{getline} function
|
|
(@pxref{Getline}). That also causes
|
|
@command{awk} to read the next record immediately, but it does not alter the
|
|
flow of control in any way (i.e., the rest of the current action executes
|
|
with a new input record).
|
|
|
|
@cindex @command{awk} programs, execution of
|
|
At the highest level, @command{awk} program execution is a loop that reads
|
|
an input record and then tests each rule's pattern against it. If you
|
|
think of this loop as a @code{for} statement whose body contains the
|
|
rules, then the @code{next} statement is analogous to a @code{continue}
|
|
statement. It skips to the end of the body of this implicit loop and
|
|
executes the increment (which reads another record).
|
|
|
|
For example, suppose an @command{awk} program works only on records
|
|
with four fields, and it shouldn't fail when given bad input. To avoid
|
|
complicating the rest of the program, write a ``weed out'' rule near
|
|
the beginning, in the following manner:
|
|
|
|
@example
|
|
NF != 4 @{
|
|
err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)
|
|
print err > "/dev/stderr"
|
|
next
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
Because of the @code{next} statement,
|
|
the program's subsequent rules won't see the bad record. The error
|
|
message is redirected to the standard error output stream, as error
|
|
messages should be.
|
|
For more detail see
|
|
@ref{Special Files}.
|
|
|
|
@c @cindex @command{awk} language, POSIX version
|
|
@c @cindex @code{next}, inside a user-defined function
|
|
@cindex @code{BEGIN} pattern, @code{next}/@code{nextfile} statements and
|
|
@cindex @code{END} pattern, @code{next}/@code{nextfile} statements and
|
|
@cindex POSIX @command{awk}, @code{next}/@code{nextfile} statements and
|
|
@cindex @code{next} statement, user-defined functions and
|
|
@cindex functions, user-defined, @code{next}/@code{nextfile} statements and
|
|
According to the POSIX standard, the behavior is undefined if
|
|
the @code{next} statement is used in a @code{BEGIN} or @code{END} rule.
|
|
@command{gawk} treats it as a syntax error.
|
|
Although POSIX permits it,
|
|
some other @command{awk} implementations don't allow the @code{next}
|
|
statement inside function bodies
|
|
(@pxref{User-defined}).
|
|
Just as with any other @code{next} statement, a @code{next} statement inside a
|
|
function body reads the next record and starts processing it with the
|
|
first rule in the program.
|
|
If the @code{next} statement causes the end of the input to be reached,
|
|
then the code in any @code{END} rules is executed.
|
|
@xref{BEGIN/END}.
|
|
|
|
@node Nextfile Statement
|
|
@subsection Using @command{gawk}'s @code{nextfile} Statement
|
|
@cindex @code{nextfile} statement
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{next}/@code{nextfile} statements
|
|
|
|
@command{gawk} provides the @code{nextfile} statement,
|
|
which is similar to the @code{next} statement.
|
|
However, instead of abandoning processing of the current record, the
|
|
@code{nextfile} statement instructs @command{gawk} to stop processing the
|
|
current @value{DF}.
|
|
|
|
The @code{nextfile} statement is a @command{gawk} extension.
|
|
In most other @command{awk} implementations,
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
@code{nextfile} is not special.
|
|
|
|
Upon execution of the @code{nextfile} statement, @code{FILENAME} is
|
|
updated to the name of the next @value{DF} listed on the command line,
|
|
@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing
|
|
starts over with the first rule in the program.
|
|
(@code{ARGIND} hasn't been introduced yet. @xref{Built-in Variables}.)
|
|
If the @code{nextfile} statement causes the end of the input to be reached,
|
|
then the code in any @code{END} rules is executed.
|
|
@xref{BEGIN/END}.
|
|
|
|
The @code{nextfile} statement is useful when there are many @value{DF}s
|
|
to process but it isn't necessary to process every record in every file.
|
|
Normally, in order to move on to the next @value{DF}, a program
|
|
has to continue scanning the unwanted records. The @code{nextfile}
|
|
statement accomplishes this much more efficiently.
|
|
|
|
While one might think that @samp{close(FILENAME)} would accomplish
|
|
the same as @code{nextfile}, this isn't true. @code{close} is
|
|
reserved for closing files, pipes, and coprocesses that are
|
|
opened with redirections. It is not related to the main processing that
|
|
@command{awk} does with the files listed in @code{ARGV}.
|
|
|
|
If it's necessary to use an @command{awk} version that doesn't support
|
|
@code{nextfile}, see
|
|
@ref{Nextfile Function},
|
|
for a user-defined function that simulates the @code{nextfile}
|
|
statement.
|
|
|
|
@cindex functions, user-defined, @code{next}/@code{nextfile} statements and
|
|
@cindex @code{nextfile} statement, user-defined functions and
|
|
The current version of the Bell Laboratories @command{awk}
|
|
(@pxref{Other Versions})
|
|
also supports @code{nextfile}. However, it doesn't allow the @code{nextfile}
|
|
statement inside function bodies
|
|
(@pxref{User-defined}).
|
|
@command{gawk} does; a @code{nextfile} inside a
|
|
function body reads the next record and starts processing it with the
|
|
first rule in the program, just as any other @code{nextfile} statement.
|
|
|
|
@cindex @code{next file} statement, in @command{gawk}
|
|
@cindex @command{gawk}, @code{next file} statement in
|
|
@cindex @code{nextfile} statement, in @command{gawk}
|
|
@cindex @command{gawk}, @code{nextfile} statement in
|
|
@strong{Caution:} Versions of @command{gawk} prior to 3.0 used two
|
|
words (@samp{next file}) for the @code{nextfile} statement.
|
|
In @value{PVERSION} 3.0, this was changed
|
|
to one word, because the treatment of @samp{file} was
|
|
inconsistent. When it appeared after @code{next}, @samp{file} was a keyword;
|
|
otherwise, it was a regular identifier. The old usage is no longer
|
|
accepted; @samp{next file} generates a syntax error.
|
|
|
|
@node Exit Statement
|
|
@subsection The @code{exit} Statement
|
|
|
|
@cindex @code{exit} statement
|
|
The @code{exit} statement causes @command{awk} to immediately stop
|
|
executing the current rule and to stop processing input; any remaining input
|
|
is ignored. The @code{exit} statement is written as follows:
|
|
|
|
@example
|
|
exit @r{[}@var{return code}@r{]}
|
|
@end example
|
|
|
|
@cindex @code{BEGIN} pattern, @code{exit} statement and
|
|
@cindex @code{END} pattern, @code{exit} statement and
|
|
When an @code{exit} statement is executed from a @code{BEGIN} rule, the
|
|
program stops processing everything immediately. No input records are
|
|
read. However, if an @code{END} rule is present,
|
|
as part of executing the @code{exit} statement,
|
|
the @code{END} rule is executed
|
|
(@pxref{BEGIN/END}).
|
|
If @code{exit} is used as part of an @code{END} rule, it causes
|
|
the program to stop immediately.
|
|
|
|
An @code{exit} statement that is not part of a @code{BEGIN} or @code{END}
|
|
rule stops the execution of any further automatic rules for the current
|
|
record, skips reading any remaining input records, and executes the
|
|
@code{END} rule if there is one.
|
|
|
|
In such a case,
|
|
if you don't want the @code{END} rule to do its job, set a variable
|
|
to nonzero before the @code{exit} statement and check that variable in
|
|
the @code{END} rule.
|
|
@xref{Assert Function},
|
|
for an example that does this.
|
|
|
|
@cindex dark corner, @code{exit} statement
|
|
If an argument is supplied to @code{exit}, its value is used as the exit
|
|
status code for the @command{awk} process. If no argument is supplied,
|
|
@code{exit} returns status zero (success). In the case where an argument
|
|
is supplied to a first @code{exit} statement, and then @code{exit} is
|
|
called a second time from an @code{END} rule with no argument,
|
|
@command{awk} uses the previously supplied exit value.
|
|
@value{DARKCORNER}
|
|
|
|
@cindex programming conventions, @code{exit} statement
|
|
For example, suppose an error condition occurs that is difficult or
|
|
impossible to handle. Conventionally, programs report this by
|
|
exiting with a nonzero status. An @command{awk} program can do this
|
|
using an @code{exit} statement with a nonzero argument, as shown
|
|
in the following example:
|
|
|
|
@example
|
|
BEGIN @{
|
|
if (("date" | getline date_now) <= 0) @{
|
|
print "Can't get system date" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
print "current date is", date_now
|
|
close("date")
|
|
@}
|
|
@end example
|
|
@c ENDOFRANGE csta
|
|
@c ENDOFRANGE acs
|
|
@c ENDOFRANGE accs
|
|
|
|
@node Built-in Variables
|
|
@section Built-in Variables
|
|
@c STARTOFRANGE bvar
|
|
@cindex built-in variables
|
|
@c STARTOFRANGE varb
|
|
@cindex variables, built-in
|
|
|
|
Most @command{awk} variables are available to use for your own
|
|
purposes; they never change unless your program assigns values to
|
|
them, and they never affect anything unless your program examines them.
|
|
However, a few variables in @command{awk} have special built-in meanings.
|
|
@command{awk} examines some of these automatically, so that they enable you
|
|
to tell @command{awk} how to do certain things. Others are set
|
|
automatically by @command{awk}, so that they carry information from the
|
|
internal workings of @command{awk} to your program.
|
|
|
|
@cindex @command{gawk}, built-in variables and
|
|
This @value{SECTION} documents all the built-in variables of
|
|
@command{gawk}, most of which are also documented in the chapters
|
|
describing their areas of activity.
|
|
|
|
@menu
|
|
* User-modified:: Built-in variables that you change to control
|
|
@command{awk}.
|
|
* Auto-set:: Built-in variables where @command{awk} gives
|
|
you information.
|
|
* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
|
|
@end menu
|
|
|
|
@node User-modified
|
|
@subsection Built-in Variables That Control @command{awk}
|
|
@c STARTOFRANGE bvaru
|
|
@cindex built-in variables, user-modifiable
|
|
@c STARTOFRANGE nmbv
|
|
@cindex user-modifiable variables
|
|
|
|
The following is an alphabetical list of variables that you can change to
|
|
control how @command{awk} does certain things. The variables that are
|
|
specific to @command{gawk} are marked with a pound sign@w{ (@samp{#}).}
|
|
|
|
@table @code
|
|
@cindex @code{BINMODE} variable
|
|
@cindex binary input/output
|
|
@cindex input/output, binary
|
|
@item BINMODE #
|
|
On non-POSIX systems, this variable specifies use of binary mode for all I/O.
|
|
Numeric values of one, two, or three specify that input files, output files, or
|
|
all files, respectively, should use binary I/O.
|
|
Alternatively,
|
|
string values of @code{"r"} or @code{"w"} specify that input files and
|
|
output files, respectively, should use binary I/O.
|
|
A string value of @code{"rw"} or @code{"wr"} indicates that all
|
|
files should use binary I/O.
|
|
Any other string value is equivalent to @code{"rw"}, but @command{gawk}
|
|
generates a warning message.
|
|
@code{BINMODE} is described in more detail in
|
|
@ref{PC Using}.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{BINMODE} variable
|
|
This variable is a @command{gawk} extension.
|
|
In other @command{awk} implementations
|
|
(except @command{mawk},
|
|
@pxref{Other Versions}),
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
it is not special.
|
|
|
|
@cindex @code{CONVFMT} variable
|
|
@cindex POSIX @command{awk}, @code{CONVFMT} variable and
|
|
@cindex numbers, converting, to strings
|
|
@cindex strings, converting, numbers to
|
|
@item CONVFMT
|
|
This string controls conversion of numbers to
|
|
strings (@pxref{Conversion}).
|
|
It works by being passed, in effect, as the first argument to the
|
|
@code{sprintf} function
|
|
(@pxref{String Functions}).
|
|
Its default value is @code{"%.6g"}.
|
|
@code{CONVFMT} was introduced by the POSIX standard.
|
|
|
|
@cindex @code{FIELDWIDTHS} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{FIELDWIDTHS} variable
|
|
@cindex field separators, @code{FIELDWIDTHS} variable and
|
|
@cindex separators, field, @code{FIELDWIDTHS} variable and
|
|
@item FIELDWIDTHS #
|
|
This is a space-separated list of columns that tells @command{gawk}
|
|
how to split input with fixed columnar boundaries.
|
|
Assigning a value to @code{FIELDWIDTHS}
|
|
overrides the use of @code{FS} for field splitting.
|
|
@xref{Constant Size}, for more information.
|
|
|
|
@cindex @command{gawk}, @code{FIELDWIDTHS} variable in
|
|
If @command{gawk} is in compatibility mode
|
|
(@pxref{Options}), then @code{FIELDWIDTHS}
|
|
has no special meaning, and field-splitting operations occur based
|
|
exclusively on the value of @code{FS}.
|
|
|
|
@cindex @code{FS} variable
|
|
@cindex separators, field
|
|
@cindex field separators
|
|
@item FS
|
|
This is the input field separator
|
|
(@pxref{Field Separators}).
|
|
The value is a single-character string or a multi-character regular
|
|
expression that matches the separations between fields in an input
|
|
record. If the value is the null string (@code{""}), then each
|
|
character in the record becomes a separate field.
|
|
(This behavior is a @command{gawk} extension. POSIX @command{awk} does not
|
|
specify the behavior when @code{FS} is the null string.)
|
|
@c NEXT ED: Mark as common extension
|
|
|
|
@cindex POSIX @command{awk}, @code{FS} variable and
|
|
The default value is @w{@code{" "}}, a string consisting of a single
|
|
space. As a special exception, this value means that any
|
|
sequence of spaces, tabs, and/or newlines is a single separator.@footnote{In
|
|
POSIX @command{awk}, newline does not count as whitespace.} It also causes
|
|
spaces, tabs, and newlines at the beginning and end of a record to be ignored.
|
|
|
|
You can set the value of @code{FS} on the command line using the
|
|
@option{-F} option:
|
|
|
|
@example
|
|
awk -F, '@var{program}' @var{input-files}
|
|
@end example
|
|
|
|
@cindex @command{gawk}, field separators and
|
|
If @command{gawk} is using @code{FIELDWIDTHS} for field splitting,
|
|
assigning a value to @code{FS} causes @command{gawk} to return to
|
|
the normal, @code{FS}-based field splitting. An easy way to do this
|
|
is to simply say @samp{FS = FS}, perhaps with an explanatory comment.
|
|
|
|
@cindex @code{IGNORECASE} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{IGNORECASE} variable
|
|
@cindex case sensitivity, string comparisons and
|
|
@cindex case sensitivity, regexps and
|
|
@cindex regular expressions, case sensitivity
|
|
@item IGNORECASE #
|
|
If @code{IGNORECASE} is nonzero or non-null, then all string comparisons
|
|
and all regular expression matching are case independent. Thus, regexp
|
|
matching with @samp{~} and @samp{!~}, as well as the @code{gensub},
|
|
@code{gsub}, @code{index}, @code{match}, @code{split}, and @code{sub}
|
|
functions, record termination with @code{RS}, and field splitting with
|
|
@code{FS}, all ignore case when doing their particular regexp operations.
|
|
However, the value of @code{IGNORECASE} does @emph{not} affect array subscripting
|
|
and it does not affect field splitting when using a single-character
|
|
field separator.
|
|
@xref{Case-sensitivity}.
|
|
|
|
@cindex @command{gawk}, @code{IGNORECASE} variable in
|
|
If @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
then @code{IGNORECASE} has no special meaning. Thus, string
|
|
and regexp operations are always case-sensitive.
|
|
|
|
@cindex @code{LINT} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{LINT} variable
|
|
@cindex lint checking
|
|
@item LINT #
|
|
When this variable is true (nonzero or non-null), @command{gawk}
|
|
behaves as if the @option{--lint} command-line option is in effect.
|
|
(@pxref{Options}).
|
|
With a value of @code{"fatal"}, lint warnings become fatal errors.
|
|
With a value of @code{"invalid"}, only warnings about things that are
|
|
actually invalid are issued. (This is not fully implemented yet.)
|
|
Any other true value prints nonfatal warnings.
|
|
Assigning a false value to @code{LINT} turns off the lint warnings.
|
|
|
|
@cindex @command{gawk}, @code{LINT} variable in
|
|
This variable is a @command{gawk} extension. It is not special
|
|
in other @command{awk} implementations. Unlike the other special variables,
|
|
changing @code{LINT} does affect the production of lint warnings,
|
|
even if @command{gawk} is in compatibility mode. Much as
|
|
the @option{--lint} and @option{--traditional} options independently
|
|
control different aspects of @command{gawk}'s behavior, the control
|
|
of lint warnings during program execution is independent of the flavor
|
|
of @command{awk} being executed.
|
|
|
|
@cindex @code{OFMT} variable
|
|
@cindex numbers, converting, to strings
|
|
@cindex strings, converting, numbers to
|
|
@item OFMT
|
|
This string controls conversion of numbers to
|
|
strings (@pxref{Conversion}) for
|
|
printing with the @code{print} statement. It works by being passed
|
|
as the first argument to the @code{sprintf} function
|
|
(@pxref{String Functions}).
|
|
Its default value is @code{"%.6g"}. Earlier versions of @command{awk}
|
|
also used @code{OFMT} to specify the format for converting numbers to
|
|
strings in general expressions; this is now done by @code{CONVFMT}.
|
|
|
|
@cindex @code{sprintf} function, @code{OFMT} variable and
|
|
@cindex @code{print} statement, @code{OFMT} variable and
|
|
@cindex @code{OFS} variable
|
|
@cindex separators, field
|
|
@cindex field separators
|
|
@item OFS
|
|
This is the output field separator (@pxref{Output Separators}). It is
|
|
output between the fields printed by a @code{print} statement. Its
|
|
default value is @w{@code{" "}}, a string consisting of a single space.
|
|
|
|
@cindex @code{ORS} variable
|
|
@item ORS
|
|
This is the output record separator. It is output at the end of every
|
|
@code{print} statement. Its default value is @code{"\n"}, the newline
|
|
character. (@xref{Output Separators}.)
|
|
|
|
@cindex @code{RS} variable
|
|
@cindex separators, record
|
|
@cindex record separators
|
|
@item RS
|
|
This is @command{awk}'s input record separator. Its default value is a string
|
|
containing a single newline character, which means that an input record
|
|
consists of a single line of text.
|
|
It can also be the null string, in which case records are separated by
|
|
runs of blank lines.
|
|
If it is a regexp, records are separated by
|
|
matches of the regexp in the input text.
|
|
(@xref{Records}.)
|
|
|
|
The ability for @code{RS} to be a regular expression
|
|
is a @command{gawk} extension.
|
|
In most other @command{awk} implementations,
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
just the first character of @code{RS}'s value is used.
|
|
|
|
@cindex @code{SUBSEP} variable
|
|
@cindex separators, subscript
|
|
@cindex subscript separators
|
|
@item SUBSEP
|
|
This is the subscript separator. It has the default value of
|
|
@code{"\034"} and is used to separate the parts of the indices of a
|
|
multidimensional array. Thus, the expression @code{@w{foo["A", "B"]}}
|
|
really accesses @code{foo["A\034B"]}
|
|
(@pxref{Multi-dimensional}).
|
|
|
|
@cindex @code{TEXTDOMAIN} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{TEXTDOMAIN} variable
|
|
@cindex internationalization, localization
|
|
@item TEXTDOMAIN #
|
|
This variable is used for internationalization of programs at the
|
|
@command{awk} level. It sets the default text domain for specially
|
|
marked string constants in the source text, as well as for the
|
|
@code{dcgettext}, @code{dcngettext} and @code{bindtextdomain} functions
|
|
(@pxref{Internationalization}).
|
|
The default value of @code{TEXTDOMAIN} is @code{"messages"}.
|
|
|
|
This variable is a @command{gawk} extension.
|
|
In other @command{awk} implementations,
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
it is not special.
|
|
@end table
|
|
@c ENDOFRANGE bvar
|
|
@c ENDOFRANGE varb
|
|
@c ENDOFRANGE bvaru
|
|
@c ENDOFRANGE nmbv
|
|
|
|
@node Auto-set
|
|
@subsection Built-in Variables That Convey Information
|
|
|
|
@c STARTOFRANGE bvconi
|
|
@cindex built-in variables, conveying information
|
|
@c STARTOFRANGE vbconi
|
|
@cindex variables, built-in, conveying information
|
|
The following is an alphabetical list of variables that @command{awk}
|
|
sets automatically on certain occasions in order to provide
|
|
information to your program. The variables that are specific to
|
|
@command{gawk} are marked with a pound sign@w{ (@samp{#}).}
|
|
|
|
@table @code
|
|
@cindex @code{ARGC}/@code{ARGV} variables
|
|
@cindex arguments, command-line
|
|
@cindex command line, arguments
|
|
@item ARGC@r{,} ARGV
|
|
The command-line arguments available to @command{awk} programs are stored in
|
|
an array called @code{ARGV}. @code{ARGC} is the number of command-line
|
|
arguments present. @xref{Other Arguments}.
|
|
Unlike most @command{awk} arrays,
|
|
@code{ARGV} is indexed from 0 to @code{ARGC} @minus{} 1.
|
|
In the following example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{
|
|
> for (i = 0; i < ARGC; i++)
|
|
> print ARGV[i]
|
|
> @}' inventory-shipped BBS-list
|
|
@print{} awk
|
|
@print{} inventory-shipped
|
|
@print{} BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
@code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
|
|
contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
|
|
@code{"BBS-list"}. The value of @code{ARGC} is three, one more than the
|
|
index of the last element in @code{ARGV}, because the elements are numbered
|
|
from zero.
|
|
|
|
@cindex programming conventions, @code{ARGC}/@code{ARGV} variables
|
|
The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing
|
|
the array from 0 to @code{ARGC} @minus{} 1, are derived from the C language's
|
|
method of accessing command-line arguments.
|
|
|
|
The value of @code{ARGV[0]} can vary from system to system.
|
|
Also, you should note that the program text is @emph{not} included in
|
|
@code{ARGV}, nor are any of @command{awk}'s command-line options.
|
|
@xref{ARGC and ARGV}, for information
|
|
about how @command{awk} uses these variables.
|
|
|
|
@cindex @code{ARGIND} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{ARGIND} variable
|
|
@item ARGIND #
|
|
The index in @code{ARGV} of the current file being processed.
|
|
Every time @command{gawk} opens a new @value{DF} for processing, it sets
|
|
@code{ARGIND} to the index in @code{ARGV} of the @value{FN}.
|
|
When @command{gawk} is processing the input files,
|
|
@samp{FILENAME == ARGV[ARGIND]} is always true.
|
|
|
|
@c comma before ARGIND does NOT mark a tertiary
|
|
@cindex files, processing, @code{ARGIND} variable and
|
|
This variable is useful in file processing; it allows you to tell how far
|
|
along you are in the list of @value{DF}s as well as to distinguish between
|
|
successive instances of the same @value{FN} on the command line.
|
|
|
|
@cindex @value{FN}s, distinguishing
|
|
While you can change the value of @code{ARGIND} within your @command{awk}
|
|
program, @command{gawk} automatically sets it to a new value when the
|
|
next file is opened.
|
|
|
|
This variable is a @command{gawk} extension.
|
|
In other @command{awk} implementations,
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
it is not special.
|
|
|
|
@cindex @code{ENVIRON} variable
|
|
@cindex environment variables
|
|
@item ENVIRON
|
|
An associative array that contains the values of the environment. The array
|
|
indices are the environment variable names; the elements are the values of
|
|
the particular environment variables. For example,
|
|
@code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array
|
|
does not affect the environment passed on to any programs that
|
|
@command{awk} may spawn via redirection or the @code{system} function.
|
|
@c (In a future version of @command{gawk}, it may do so.)
|
|
|
|
Some operating systems may not have environment variables.
|
|
On such systems, the @code{ENVIRON} array is empty (except for
|
|
@w{@code{ENVIRON["AWKPATH"]}},
|
|
@pxref{AWKPATH Variable}).
|
|
|
|
@cindex @code{ERRNO} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{ERRNO} variable
|
|
@cindex error handling, @code{ERRNO} variable and
|
|
@item ERRNO #
|
|
If a system error occurs during a redirection for @code{getline},
|
|
during a read for @code{getline}, or during a @code{close} operation,
|
|
then @code{ERRNO} contains a string describing the error.
|
|
|
|
This variable is a @command{gawk} extension.
|
|
In other @command{awk} implementations,
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
it is not special.
|
|
|
|
@cindex @code{FILENAME} variable
|
|
@cindex dark corner, @code{FILENAME} variable
|
|
@item FILENAME
|
|
The name of the file that @command{awk} is currently reading.
|
|
When no @value{DF}s are listed on the command line, @command{awk} reads
|
|
from the standard input and @code{FILENAME} is set to @code{"-"}.
|
|
@code{FILENAME} is changed each time a new file is read
|
|
(@pxref{Reading Files}).
|
|
Inside a @code{BEGIN} rule, the value of @code{FILENAME} is
|
|
@code{""}, since there are no input files being processed
|
|
yet.@footnote{Some early implementations of Unix @command{awk} initialized
|
|
@code{FILENAME} to @code{"-"}, even if there were @value{DF}s to be
|
|
processed. This behavior was incorrect and should not be relied
|
|
upon in your programs.}
|
|
@value{DARKCORNER}
|
|
Note, though, that using @code{getline}
|
|
(@pxref{Getline})
|
|
inside a @code{BEGIN} rule can give
|
|
@code{FILENAME} a value.
|
|
|
|
@cindex @code{FNR} variable
|
|
@item FNR
|
|
The current record number in the current file. @code{FNR} is
|
|
incremented each time a new record is read
|
|
(@pxref{Getline}). It is reinitialized
|
|
to zero each time a new input file is started.
|
|
|
|
@cindex @code{NF} variable
|
|
@item NF
|
|
The number of fields in the current input record.
|
|
@code{NF} is set each time a new record is read, when a new field is
|
|
created or when @code{$0} changes (@pxref{Fields}).
|
|
|
|
Unlike most of the variables described in this
|
|
@ifnotinfo
|
|
section,
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
node,
|
|
@end ifinfo
|
|
assigning a value to @code{NF} has the potential to affect
|
|
@command{awk}'s internal workings. In particular, assignments
|
|
to @code{NF} can be used to create or remove fields from the
|
|
current record: @xref{Changing Fields}.
|
|
|
|
@cindex @code{NR} variable
|
|
@item NR
|
|
The number of input records @command{awk} has processed since
|
|
the beginning of the program's execution
|
|
(@pxref{Records}).
|
|
@code{NR} is incremented each time a new record is read.
|
|
|
|
@cindex @code{PROCINFO} array
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{PROCINFO} array
|
|
@item PROCINFO #
|
|
The elements of this array provide access to information about the
|
|
running @command{awk} program.
|
|
The following elements (listed alphabetically)
|
|
are guaranteed to be available:
|
|
|
|
@table @code
|
|
@item PROCINFO["egid"]
|
|
The value of the @code{getegid} system call.
|
|
|
|
@item PROCINFO["euid"]
|
|
The value of the @code{geteuid} system call.
|
|
|
|
@item PROCINFO["FS"]
|
|
This is
|
|
@code{"FS"} if field splitting with @code{FS} is in effect, or it is
|
|
@code{"FIELDWIDTHS"} if field splitting with @code{FIELDWIDTHS} is in effect.
|
|
|
|
@item PROCINFO["gid"]
|
|
The value of the @code{getgid} system call.
|
|
|
|
@item PROCINFO["pgrpid"]
|
|
The process group ID of the current process.
|
|
|
|
@item PROCINFO["pid"]
|
|
The process ID of the current process.
|
|
|
|
@item PROCINFO["ppid"]
|
|
The parent process ID of the current process.
|
|
|
|
@item PROCINFO["uid"]
|
|
The value of the @code{getuid} system call.
|
|
@end table
|
|
|
|
On some systems, there may be elements in the array, @code{"group1"}
|
|
through @code{"group@var{N}"} for some @var{N}. @var{N} is the number of
|
|
supplementary groups that the process has. Use the @code{in} operator
|
|
to test for these elements
|
|
(@pxref{Reference to Elements}).
|
|
|
|
This array is a @command{gawk} extension.
|
|
In other @command{awk} implementations,
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
it is not special.
|
|
|
|
@cindex @code{RLENGTH} variable
|
|
@item RLENGTH
|
|
The length of the substring matched by the
|
|
@code{match} function
|
|
(@pxref{String Functions}).
|
|
@code{RLENGTH} is set by invoking the @code{match} function. Its value
|
|
is the length of the matched string, or @minus{}1 if no match is found.
|
|
|
|
@cindex @code{RSTART} variable
|
|
@item RSTART
|
|
The start-index in characters of the substring that is matched by the
|
|
@code{match} function
|
|
(@pxref{String Functions}).
|
|
@code{RSTART} is set by invoking the @code{match} function. Its value
|
|
is the position of the string where the matched substring starts, or zero
|
|
if no match was found.
|
|
|
|
@cindex @code{RT} variable
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{RT} variable
|
|
@item RT #
|
|
This is set each time a record is read. It contains the input text
|
|
that matched the text denoted by @code{RS}, the record separator.
|
|
|
|
This variable is a @command{gawk} extension.
|
|
In other @command{awk} implementations,
|
|
or if @command{gawk} is in compatibility mode
|
|
(@pxref{Options}),
|
|
it is not special.
|
|
@end table
|
|
@c ENDOFRANGE bvconi
|
|
@c ENDOFRANGE vbconi
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Changing @code{NR} and @code{FNR}
|
|
@cindex @code{NR} variable, changing
|
|
@cindex @code{FNR} variable, changing
|
|
@cindex advanced features, @code{FNR}/@code{NR} variables
|
|
@cindex dark corner, @code{FNR}/@code{NR} variables
|
|
@command{awk} increments @code{NR} and @code{FNR}
|
|
each time it reads a record, instead of setting them to the absolute
|
|
value of the number of records read. This means that a program can
|
|
change these variables and their new values are incremented for
|
|
each record.
|
|
@value{DARKCORNER}
|
|
This is demonstrated in the following example:
|
|
|
|
@example
|
|
$ echo '1
|
|
> 2
|
|
> 3
|
|
> 4' | awk 'NR == 2 @{ NR = 17 @}
|
|
> @{ print NR @}'
|
|
@print{} 1
|
|
@print{} 17
|
|
@print{} 18
|
|
@print{} 19
|
|
@end example
|
|
|
|
@noindent
|
|
Before @code{FNR} was added to the @command{awk} language
|
|
(@pxref{V7/SVR3.1}),
|
|
many @command{awk} programs used this feature to track the number of
|
|
records in a file by resetting @code{NR} to zero when @code{FILENAME}
|
|
changed.
|
|
|
|
@node ARGC and ARGV
|
|
@subsection Using @code{ARGC} and @code{ARGV}
|
|
@cindex @code{ARGC}/@code{ARGV} variables
|
|
@cindex arguments, command-line
|
|
@cindex command line, arguments
|
|
|
|
@ref{Auto-set},
|
|
presented the following program describing the information contained in @code{ARGC}
|
|
and @code{ARGV}:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{
|
|
> for (i = 0; i < ARGC; i++)
|
|
> print ARGV[i]
|
|
> @}' inventory-shipped BBS-list
|
|
@print{} awk
|
|
@print{} inventory-shipped
|
|
@print{} BBS-list
|
|
@end example
|
|
|
|
@noindent
|
|
In this example, @code{ARGV[0]} contains @samp{awk}, @code{ARGV[1]}
|
|
contains @samp{inventory-shipped}, and @code{ARGV[2]} contains
|
|
@samp{BBS-list}.
|
|
Notice that the @command{awk} program is not entered in @code{ARGV}. The
|
|
other special command-line options, with their arguments, are also not
|
|
entered. This includes variable assignments done with the @option{-v}
|
|
option (@pxref{Options}).
|
|
Normal variable assignments on the command line @emph{are}
|
|
treated as arguments and do show up in the @code{ARGV} array:
|
|
|
|
@example
|
|
$ cat showargs.awk
|
|
@print{} BEGIN @{
|
|
@print{} printf "A=%d, B=%d\n", A, B
|
|
@print{} for (i = 0; i < ARGC; i++)
|
|
@print{} printf "\tARGV[%d] = %s\n", i, ARGV[i]
|
|
@print{} @}
|
|
@print{} END @{ printf "A=%d, B=%d\n", A, B @}
|
|
$ awk -v A=1 -f showargs.awk B=2 /dev/null
|
|
@print{} A=1, B=0
|
|
@print{} ARGV[0] = awk
|
|
@print{} ARGV[1] = B=2
|
|
@print{} ARGV[2] = /dev/null
|
|
@print{} A=1, B=2
|
|
@end example
|
|
|
|
A program can alter @code{ARGC} and the elements of @code{ARGV}.
|
|
Each time @command{awk} reaches the end of an input file, it uses the next
|
|
element of @code{ARGV} as the name of the next input file. By storing a
|
|
different string there, a program can change which files are read.
|
|
Use @code{"-"} to represent the standard input. Storing
|
|
additional elements and incrementing @code{ARGC} causes
|
|
additional files to be read.
|
|
|
|
If the value of @code{ARGC} is decreased, that eliminates input files
|
|
from the end of the list. By recording the old value of @code{ARGC}
|
|
elsewhere, a program can treat the eliminated arguments as
|
|
something other than @value{FN}s.
|
|
|
|
To eliminate a file from the middle of the list, store the null string
|
|
(@code{""}) into @code{ARGV} in place of the file's name. As a
|
|
special feature, @command{awk} ignores @value{FN}s that have been
|
|
replaced with the null string.
|
|
Another option is to
|
|
use the @code{delete} statement to remove elements from
|
|
@code{ARGV} (@pxref{Delete}).
|
|
|
|
All of these actions are typically done in the @code{BEGIN} rule,
|
|
before actual processing of the input begins.
|
|
@xref{Split Program}, and see
|
|
@ref{Tee Program}, for examples
|
|
of each way of removing elements from @code{ARGV}.
|
|
The following fragment processes @code{ARGV} in order to examine, and
|
|
then remove, command-line options:
|
|
@c NEXT ED: Add xref to rewind() function
|
|
|
|
@example
|
|
BEGIN @{
|
|
for (i = 1; i < ARGC; i++) @{
|
|
if (ARGV[i] == "-v")
|
|
verbose = 1
|
|
else if (ARGV[i] == "-d")
|
|
debug = 1
|
|
else if (ARGV[i] ~ /^-?/) @{
|
|
e = sprintf("%s: unrecognized option -- %c",
|
|
ARGV[0], substr(ARGV[i], 1, ,1))
|
|
print e > "/dev/stderr"
|
|
@} else
|
|
break
|
|
delete ARGV[i]
|
|
@}
|
|
@}
|
|
@end example
|
|
|
|
To actually get the options into the @command{awk} program,
|
|
end the @command{awk} options with @option{--} and then supply
|
|
the @command{awk} program's options, in the following manner:
|
|
|
|
@example
|
|
awk -f myprog -- -v -d file1 file2 @dots{}
|
|
@end example
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{ARGC}/@code{ARGV} variables
|
|
This is not necessary in @command{gawk}. Unless @option{--posix} has
|
|
been specified, @command{gawk} silently puts any unrecognized options
|
|
into @code{ARGV} for the @command{awk} program to deal with. As soon
|
|
as it sees an unknown option, @command{gawk} stops looking for other
|
|
options that it might otherwise recognize. The previous example with
|
|
@command{gawk} would be:
|
|
|
|
@example
|
|
gawk -f myprog -d -v file1 file2 @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
Because @option{-d} is not a valid @command{gawk} option,
|
|
it and the following @option{-v}
|
|
are passed on to the @command{awk} program.
|
|
|
|
@node Arrays
|
|
@chapter Arrays in @command{awk}
|
|
@c STARTOFRANGE arrs
|
|
@cindex arrays
|
|
|
|
An @dfn{array} is a table of values called @dfn{elements}. The
|
|
elements of an array are distinguished by their indices. @dfn{Indices}
|
|
may be either numbers or strings.
|
|
|
|
This @value{CHAPTER} describes how arrays work in @command{awk},
|
|
how to use array elements, how to scan through every element in an array,
|
|
and how to remove array elements.
|
|
It also describes how @command{awk} simulates multidimensional
|
|
arrays, as well as some of the less obvious points about array usage.
|
|
The @value{CHAPTER} finishes with a discussion of @command{gawk}'s facility
|
|
for sorting an array based on its indices.
|
|
|
|
@cindex variables, names of
|
|
@cindex functions, names of
|
|
@cindex arrays, names of
|
|
@cindex names, arrays/variables
|
|
@cindex namespace issues
|
|
@command{awk} maintains a single set
|
|
of names that may be used for naming variables, arrays, and functions
|
|
(@pxref{User-defined}).
|
|
Thus, you cannot have a variable and an array with the same name in the
|
|
same @command{awk} program.
|
|
|
|
@menu
|
|
* Array Intro:: Introduction to Arrays
|
|
* Reference to Elements:: How to examine one element of an array.
|
|
* Assigning Elements:: How to change an element of an array.
|
|
* Array Example:: Basic Example of an Array
|
|
* Scanning an Array:: A variation of the @code{for} statement. It
|
|
loops through the indices of an array's
|
|
existing elements.
|
|
* Delete:: The @code{delete} statement removes an element
|
|
from an array.
|
|
* Numeric Array Subscripts:: How to use numbers as subscripts in
|
|
@command{awk}.
|
|
* Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
|
|
* Multi-dimensional:: Emulating multidimensional arrays in
|
|
@command{awk}.
|
|
* Multi-scanning:: Scanning multidimensional arrays.
|
|
* Array Sorting:: Sorting array values and indices.
|
|
@end menu
|
|
|
|
@node Array Intro
|
|
@section Introduction to Arrays
|
|
|
|
The @command{awk} language provides one-dimensional arrays
|
|
for storing groups of related strings or numbers.
|
|
Every @command{awk} array must have a name. Array names have the same
|
|
syntax as variable names; any valid variable name would also be a valid
|
|
array name. But one name cannot be used in both ways (as an array and
|
|
as a variable) in the same @command{awk} program.
|
|
|
|
Arrays in @command{awk} superficially resemble arrays in other programming
|
|
languages, but there are fundamental differences. In @command{awk}, it
|
|
isn't necessary to specify the size of an array before starting to use it.
|
|
Additionally, any number or string in @command{awk}, not just consecutive integers,
|
|
may be used as an array index.
|
|
|
|
In most other languages, arrays must be @dfn{declared} before use,
|
|
including a specification of
|
|
how many elements or components they contain. In such languages, the
|
|
declaration causes a contiguous block of memory to be allocated for that
|
|
many elements. Usually, an index in the array must be a positive integer.
|
|
For example, the index zero specifies the first element in the array, which is
|
|
actually stored at the beginning of the block of memory. Index one
|
|
specifies the second element, which is stored in memory right after the
|
|
first element, and so on. It is impossible to add more elements to the
|
|
array, because it has room only for as many elements as given in
|
|
the declaration.
|
|
(Some languages allow arbitrary starting and ending
|
|
indices---e.g., @samp{15 .. 27}---but the size of the array is still fixed when
|
|
the array is declared.)
|
|
|
|
A contiguous array of four elements might look like the following example,
|
|
conceptually, if the element values are 8, @code{"foo"},
|
|
@code{""}, and 30:
|
|
|
|
@c NEXT ED: Use real images here
|
|
@iftex
|
|
@c from Karl Berry, much thanks for the help.
|
|
@tex
|
|
\bigskip % space above the table (about 1 linespace)
|
|
\offinterlineskip
|
|
\newdimen\width \width = 1.5cm
|
|
\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt
|
|
\centerline{\vbox{
|
|
\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr
|
|
\noalign{\hrule width\hwidth}
|
|
&&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad Value\cr
|
|
\noalign{\hrule width\hwidth}
|
|
\noalign{\smallskip}
|
|
&\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad Index\cr
|
|
}
|
|
}}
|
|
@end tex
|
|
@end iftex
|
|
@ifinfo
|
|
@example
|
|
+---------+---------+--------+---------+
|
|
| 8 | "foo" | "" | 30 | @r{Value}
|
|
+---------+---------+--------+---------+
|
|
0 1 2 3 @r{Index}
|
|
@end example
|
|
@end ifinfo
|
|
@ifxml
|
|
@example
|
|
+---------+---------+--------+---------+
|
|
| 8 | "foo" | "" | 30 | @r{Value}
|
|
+---------+---------+--------+---------+
|
|
0 1 2 3 @r{Index}
|
|
@end example
|
|
@end ifxml
|
|
|
|
@noindent
|
|
Only the values are stored; the indices are implicit from the order of
|
|
the values. Here, 8 is the value at index zero, because 8 appears in the
|
|
position with zero elements before it.
|
|
|
|
@c STARTOFRANGE arrin
|
|
@cindex arrays, indexing
|
|
@c STARTOFRANGE inarr
|
|
@cindex indexing arrays
|
|
@cindex associative arrays
|
|
@cindex arrays, associative
|
|
Arrays in @command{awk} are different---they are @dfn{associative}. This means
|
|
that each array is a collection of pairs: an index and its corresponding
|
|
array element value:
|
|
|
|
@example
|
|
@r{Element} 3 @r{Value} 30
|
|
@r{Element} 1 @r{Value} "foo"
|
|
@r{Element} 0 @r{Value} 8
|
|
@r{Element} 2 @r{Value} ""
|
|
@end example
|
|
|
|
@noindent
|
|
The pairs are shown in jumbled order because their order is irrelevant.
|
|
|
|
One advantage of associative arrays is that new pairs can be added
|
|
at any time. For example, suppose a tenth element is added to the array
|
|
whose value is @w{@code{"number ten"}}. The result is:
|
|
|
|
@example
|
|
@r{Element} 10 @r{Value} "number ten"
|
|
@r{Element} 3 @r{Value} 30
|
|
@r{Element} 1 @r{Value} "foo"
|
|
@r{Element} 0 @r{Value} 8
|
|
@r{Element} 2 @r{Value} ""
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex sparse arrays
|
|
@cindex arrays, sparse
|
|
Now the array is @dfn{sparse}, which just means some indices are missing.
|
|
It has elements 0--3 and 10, but doesn't have elements 4, 5, 6, 7, 8, or 9.
|
|
|
|
Another consequence of associative arrays is that the indices don't
|
|
have to be positive integers. Any number, or even a string, can be
|
|
an index. For example, the following is an array that translates words from
|
|
English to French:
|
|
|
|
@example
|
|
@r{Element} "dog" @r{Value} "chien"
|
|
@r{Element} "cat" @r{Value} "chat"
|
|
@r{Element} "one" @r{Value} "un"
|
|
@r{Element} 1 @r{Value} "un"
|
|
@end example
|
|
|
|
@noindent
|
|
Here we decided to translate the number one in both spelled-out and
|
|
numeric form---thus illustrating that a single array can have both
|
|
numbers and strings as indices.
|
|
In fact, array subscripts are always strings; this is discussed
|
|
in more detail in
|
|
@ref{Numeric Array Subscripts}.
|
|
Here, the number @code{1} isn't double-quoted, since @command{awk}
|
|
automatically converts it to a string.
|
|
|
|
@cindex case sensitivity, array indices and
|
|
@cindex arrays, @code{IGNORECASE} variable and
|
|
@cindex @code{IGNORECASE} variable, array subscripts and
|
|
The value of @code{IGNORECASE} has no effect upon array subscripting.
|
|
The identical string value used to store an array element must be used
|
|
to retrieve it.
|
|
When @command{awk} creates an array (e.g., with the @code{split}
|
|
built-in function),
|
|
that array's indices are consecutive integers starting at one.
|
|
(@xref{String Functions}.)
|
|
|
|
@command{awk}'s arrays are efficient---the time to access an element
|
|
is independent of the number of elements in the array.
|
|
@c ENDOFRANGE arrin
|
|
@c ENDOFRANGE inarr
|
|
|
|
@node Reference to Elements
|
|
@section Referring to an Array Element
|
|
@cindex arrays, elements, referencing
|
|
@cindex elements in arrays
|
|
|
|
The principal way to use an array is to refer to one of its elements.
|
|
An array reference is an expression as follows:
|
|
|
|
@example
|
|
@var{array}[@var{index}]
|
|
@end example
|
|
|
|
@noindent
|
|
Here, @var{array} is the name of an array. The expression @var{index} is
|
|
the index of the desired element of the array.
|
|
|
|
The value of the array reference is the current value of that array
|
|
element. For example, @code{foo[4.3]} is an expression for the element
|
|
of array @code{foo} at index @samp{4.3}.
|
|
|
|
A reference to an array element that has no recorded value yields a value of
|
|
@code{""}, the null string. This includes elements
|
|
that have not been assigned any value as well as elements that have been
|
|
deleted (@pxref{Delete}). Such a reference
|
|
automatically creates that array element, with the null string as its value.
|
|
(In some cases, this is unfortunate, because it might waste memory inside
|
|
@command{awk}.)
|
|
|
|
@c @cindex arrays, @code{in} operator and
|
|
@cindex @code{in} operator, arrays and
|
|
To determine whether an element exists in an array at a certain index, use
|
|
the following expression:
|
|
|
|
@example
|
|
@var{index} in @var{array}
|
|
@end example
|
|
|
|
@cindex side effects, array indexing
|
|
@noindent
|
|
This expression tests whether the particular index exists,
|
|
without the side effect of creating that element if it is not present.
|
|
The expression has the value one (true) if @code{@var{array}[@var{index}]}
|
|
exists and zero (false) if it does not exist.
|
|
For example, this statement tests whether the array @code{frequencies}
|
|
contains the index @samp{2}:
|
|
|
|
@example
|
|
if (2 in frequencies)
|
|
print "Subscript 2 is present."
|
|
@end example
|
|
|
|
Note that this is @emph{not} a test of whether the array
|
|
@code{frequencies} contains an element whose @emph{value} is two.
|
|
There is no way to do that except to scan all the elements. Also, this
|
|
@emph{does not} create @code{frequencies[2]}, while the following
|
|
(incorrect) alternative does:
|
|
|
|
@example
|
|
if (frequencies[2] != "")
|
|
print "Subscript 2 is present."
|
|
@end example
|
|
|
|
@node Assigning Elements
|
|
@section Assigning Array Elements
|
|
@cindex arrays, elements, assigning
|
|
@cindex elements in arrays, assigning
|
|
|
|
Array elements can be assigned values just like
|
|
@command{awk} variables:
|
|
|
|
@example
|
|
@var{array}[@var{subscript}] = @var{value}
|
|
@end example
|
|
|
|
@noindent
|
|
@var{array} is the name of an array. The expression
|
|
@var{subscript} is the index of the element of the array that is
|
|
assigned a value. The expression @var{value} is the value to
|
|
assign to that element of the array.
|
|
|
|
@node Array Example
|
|
@section Basic Array Example
|
|
|
|
The following program takes a list of lines, each beginning with a line
|
|
number, and prints them out in order of line number. The line numbers
|
|
are not in order when they are first read---instead they
|
|
are scrambled. This program sorts the lines by making an array using
|
|
the line numbers as subscripts. The program then prints out the lines
|
|
in sorted order of their numbers. It is a very simple program and gets
|
|
confused upon encountering repeated numbers, gaps, or lines that don't
|
|
begin with a number:
|
|
|
|
@example
|
|
@c file eg/misc/arraymax.awk
|
|
@{
|
|
if ($1 > max)
|
|
max = $1
|
|
arr[$1] = $0
|
|
@}
|
|
|
|
END @{
|
|
for (x = 1; x <= max; x++)
|
|
print arr[x]
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The first rule keeps track of the largest line number seen so far;
|
|
it also stores each line into the array @code{arr}, at an index that
|
|
is the line's number.
|
|
The second rule runs after all the input has been read, to print out
|
|
all the lines.
|
|
When this program is run with the following input:
|
|
|
|
@example
|
|
@c file eg/misc/arraymax.data
|
|
5 I am the Five man
|
|
2 Who are you? The new number two!
|
|
4 . . . And four on the floor
|
|
1 Who is number one?
|
|
3 I three you.
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
Its output is:
|
|
|
|
@example
|
|
1 Who is number one?
|
|
2 Who are you? The new number two!
|
|
3 I three you.
|
|
4 . . . And four on the floor
|
|
5 I am the Five man
|
|
@end example
|
|
|
|
If a line number is repeated, the last line with a given number overrides
|
|
the others.
|
|
Gaps in the line numbers can be handled with an easy improvement to the
|
|
program's @code{END} rule, as follows:
|
|
|
|
@example
|
|
END @{
|
|
for (x = 1; x <= max; x++)
|
|
if (x in arr)
|
|
print arr[x]
|
|
@}
|
|
@end example
|
|
|
|
@node Scanning an Array
|
|
@section Scanning All Elements of an Array
|
|
@cindex elements in arrays, scanning
|
|
@cindex arrays, scanning
|
|
|
|
In programs that use arrays, it is often necessary to use a loop that
|
|
executes once for each element of an array. In other languages, where
|
|
arrays are contiguous and indices are limited to positive integers,
|
|
this is easy: all the valid indices can be found by counting from
|
|
the lowest index up to the highest. This technique won't do the job
|
|
in @command{awk}, because any number or string can be an array index.
|
|
So @command{awk} has a special kind of @code{for} statement for scanning
|
|
an array:
|
|
|
|
@example
|
|
for (@var{var} in @var{array})
|
|
@var{body}
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex @code{in} operator, arrays and
|
|
This loop executes @var{body} once for each index in @var{array} that the
|
|
program has previously used, with the variable @var{var} set to that index.
|
|
|
|
@cindex arrays, @code{for} statement and
|
|
@cindex @code{for} statement, in arrays
|
|
The following program uses this form of the @code{for} statement. The
|
|
first rule scans the input records and notes which words appear (at
|
|
least once) in the input, by storing a one into the array @code{used} with
|
|
the word as index. The second rule scans the elements of @code{used} to
|
|
find all the distinct words that appear in the input. It prints each
|
|
word that is more than 10 characters long and also prints the number of
|
|
such words.
|
|
@xref{String Functions},
|
|
for more information on the built-in function @code{length}.
|
|
|
|
@example
|
|
# Record a 1 for each word that is used at least once
|
|
@{
|
|
for (i = 1; i <= NF; i++)
|
|
used[$i] = 1
|
|
@}
|
|
|
|
# Find number of distinct words more than 10 characters long
|
|
END @{
|
|
for (x in used)
|
|
if (length(x) > 10) @{
|
|
++num_long_words
|
|
print x
|
|
@}
|
|
print num_long_words, "words longer than 10 characters"
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
@xref{Word Sorting},
|
|
for a more detailed example of this type.
|
|
|
|
@cindex arrays, elements, order of
|
|
@cindex elements in arrays, order of
|
|
The order in which elements of the array are accessed by this statement
|
|
is determined by the internal arrangement of the array elements within
|
|
@command{awk} and cannot be controlled or changed. This can lead to
|
|
problems if new elements are added to @var{array} by statements in
|
|
the loop body; it is not predictable whether the @code{for} loop will
|
|
reach them. Similarly, changing @var{var} inside the loop may produce
|
|
strange results. It is best to avoid such things.
|
|
|
|
@node Delete
|
|
@section The @code{delete} Statement
|
|
@cindex @code{delete} statement
|
|
@cindex deleting elements in arrays
|
|
@cindex arrays, elements, deleting
|
|
@cindex elements in arrays, deleting
|
|
|
|
To remove an individual element of an array, use the @code{delete}
|
|
statement:
|
|
|
|
@example
|
|
delete @var{array}[@var{index}]
|
|
@end example
|
|
|
|
Once an array element has been deleted, any value the element once
|
|
had is no longer available. It is as if the element had never
|
|
been referred to or had been given a value.
|
|
The following is an example of deleting elements in an array:
|
|
|
|
@example
|
|
for (i in frequencies)
|
|
delete frequencies[i]
|
|
@end example
|
|
|
|
@noindent
|
|
This example removes all the elements from the array @code{frequencies}.
|
|
Once an element is deleted, a subsequent @code{for} statement to scan the array
|
|
does not report that element and the @code{in} operator to check for
|
|
the presence of that element returns zero (i.e., false):
|
|
|
|
@example
|
|
delete foo[4]
|
|
if (4 in foo)
|
|
print "This will never be printed"
|
|
@end example
|
|
|
|
@cindex null strings, array elements and
|
|
It is important to note that deleting an element is @emph{not} the
|
|
same as assigning it a null value (the empty string, @code{""}).
|
|
For example:
|
|
|
|
@example
|
|
foo[4] = ""
|
|
if (4 in foo)
|
|
print "This is printed, even though foo[4] is empty"
|
|
@end example
|
|
|
|
@cindex lint checking, array elements
|
|
It is not an error to delete an element that does not exist.
|
|
If @option{--lint} is provided on the command line
|
|
(@pxref{Options}),
|
|
@command{gawk} issues a warning message when an element that
|
|
is not in the array is deleted.
|
|
|
|
@cindex arrays, deleting entire contents
|
|
@cindex deleting entire arrays
|
|
@cindex differences in @command{awk} and @command{gawk}, array elements, deleting
|
|
All the elements of an array may be deleted with a single statement
|
|
by leaving off the subscript in the @code{delete} statement,
|
|
as follows:
|
|
|
|
@example
|
|
delete @var{array}
|
|
@end example
|
|
|
|
This ability is a @command{gawk} extension; it is not available in
|
|
compatibility mode (@pxref{Options}).
|
|
|
|
Using this version of the @code{delete} statement is about three times
|
|
more efficient than the equivalent loop that deletes each element one
|
|
at a time.
|
|
|
|
@cindex portability, deleting array elements
|
|
@cindex Brennan, Michael
|
|
The following statement provides a portable but nonobvious way to clear
|
|
out an array:@footnote{Thanks to Michael Brennan for pointing this out.}
|
|
|
|
@example
|
|
split("", array)
|
|
@end example
|
|
|
|
@c comma before deleting does NOT start a tertiary
|
|
@cindex @code{split} function, array elements, deleting
|
|
The @code{split} function
|
|
(@pxref{String Functions})
|
|
clears out the target array first. This call asks it to split
|
|
apart the null string. Because there is no data to split out, the
|
|
function simply clears the array and then returns.
|
|
|
|
@strong{Caution:} Deleting an array does not change its type; you cannot
|
|
delete an array and then use the array's name as a scalar
|
|
(i.e., a regular variable). For example, the following does not work:
|
|
|
|
@example
|
|
a[1] = 3; delete a; a = 3
|
|
@end example
|
|
|
|
@node Numeric Array Subscripts
|
|
@section Using Numbers to Subscript Arrays
|
|
|
|
@cindex numbers, as array subscripts
|
|
@cindex arrays, subscripts
|
|
@cindex subscripts in arrays, numbers as
|
|
@cindex @code{CONVFMT} variable, array subscripts and
|
|
An important aspect about arrays to remember is that @emph{array subscripts
|
|
are always strings}. When a numeric value is used as a subscript,
|
|
it is converted to a string value before being used for subscripting
|
|
(@pxref{Conversion}).
|
|
This means that the value of the built-in variable @code{CONVFMT} can
|
|
affect how your program accesses elements of an array. For example:
|
|
|
|
@example
|
|
xyz = 12.153
|
|
data[xyz] = 1
|
|
CONVFMT = "%2.2f"
|
|
if (xyz in data)
|
|
printf "%s is in data\n", xyz
|
|
else
|
|
printf "%s is not in data\n", xyz
|
|
@end example
|
|
|
|
@noindent
|
|
This prints @samp{12.15 is not in data}. The first statement gives
|
|
@code{xyz} a numeric value. Assigning to
|
|
@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"}
|
|
(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}).
|
|
Thus, the array element @code{data["12.153"]} is assigned the value one.
|
|
The program then changes
|
|
the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new
|
|
string value from @code{xyz}---this time @code{"12.15"}---because the value of
|
|
@code{CONVFMT} only allows two significant digits. This test fails,
|
|
since @code{"12.15"} is a different string from @code{"12.153"}.
|
|
|
|
@cindex converting, during subscripting
|
|
According to the rules for conversions
|
|
(@pxref{Conversion}), integer
|
|
values are always converted to strings as integers, no matter what the
|
|
value of @code{CONVFMT} may happen to be. So the usual case of
|
|
the following works:
|
|
|
|
@example
|
|
for (i = 1; i <= maxsub; i++)
|
|
@i{do something with} array[i]
|
|
@end example
|
|
|
|
The ``integer values always convert to strings as integers'' rule
|
|
has an additional consequence for array indexing.
|
|
Octal and hexadecimal constants
|
|
(@pxref{Nondecimal-numbers})
|
|
are converted internally into numbers, and their original form
|
|
is forgotten.
|
|
This means, for example, that
|
|
@code{array[17]},
|
|
@code{array[021]},
|
|
and
|
|
@code{array[0x11]}
|
|
all refer to the same element!
|
|
|
|
As with many things in @command{awk}, the majority of the time
|
|
things work as one would expect them to. But it is useful to have a precise
|
|
knowledge of the actual rules which sometimes can have a subtle
|
|
effect on your programs.
|
|
|
|
@node Uninitialized Subscripts
|
|
@section Using Uninitialized Variables as Subscripts
|
|
|
|
@c last comma does NOT start a tertiary
|
|
@cindex variables, uninitialized, as array subscripts
|
|
@cindex uninitialized variables, as array subscripts
|
|
@cindex subscripts in arrays, uninitialized variables as
|
|
@cindex arrays, subscripts, uninitialized variables as
|
|
Suppose it's necessary to write a program
|
|
to print the input data in reverse order.
|
|
A reasonable attempt to do so (with some test
|
|
data) might look like this:
|
|
|
|
@example
|
|
$ echo 'line 1
|
|
> line 2
|
|
> line 3' | awk '@{ l[lines] = $0; ++lines @}
|
|
> END @{
|
|
> for (i = lines-1; i >= 0; --i)
|
|
> print l[i]
|
|
> @}'
|
|
@print{} line 3
|
|
@print{} line 2
|
|
@end example
|
|
|
|
Unfortunately, the very first line of input data did not come out in the
|
|
output!
|
|
|
|
At first glance, this program should have worked. The variable @code{lines}
|
|
is uninitialized, and uninitialized variables have the numeric value zero.
|
|
So, @command{awk} should have printed the value of @code{l[0]}.
|
|
|
|
The issue here is that subscripts for @command{awk} arrays are @emph{always}
|
|
strings. Uninitialized variables, when used as strings, have the
|
|
value @code{""}, not zero. Thus, @samp{line 1} ends up stored in
|
|
@code{l[""]}.
|
|
The following version of the program works correctly:
|
|
|
|
@example
|
|
@{ l[lines++] = $0 @}
|
|
END @{
|
|
for (i = lines - 1; i >= 0; --i)
|
|
print l[i]
|
|
@}
|
|
@end example
|
|
|
|
Here, the @samp{++} forces @code{lines} to be numeric, thus making
|
|
the ``old value'' numeric zero. This is then converted to @code{"0"}
|
|
as the array subscript.
|
|
|
|
@cindex null strings, as array subscripts
|
|
@cindex dark corner, array subscripts
|
|
@cindex lint checking, array subscripts
|
|
Even though it is somewhat unusual, the null string
|
|
(@code{""}) is a valid array subscript.
|
|
@value{DARKCORNER}
|
|
@command{gawk} warns about the use of the null string as a subscript
|
|
if @option{--lint} is provided
|
|
on the command line (@pxref{Options}).
|
|
|
|
@node Multi-dimensional
|
|
@section Multidimensional Arrays
|
|
|
|
@cindex subscripts in arrays, multidimensional
|
|
@cindex arrays, multidimensional
|
|
A multidimensional array is an array in which an element is identified
|
|
by a sequence of indices instead of a single index. For example, a
|
|
two-dimensional array requires two indices. The usual way (in most
|
|
languages, including @command{awk}) to refer to an element of a
|
|
two-dimensional array named @code{grid} is with
|
|
@code{grid[@var{x},@var{y}]}.
|
|
|
|
@cindex @code{SUBSEP} variable, multidimensional arrays
|
|
Multidimensional arrays are supported in @command{awk} through
|
|
concatenation of indices into one string.
|
|
@command{awk} converts the indices into strings
|
|
(@pxref{Conversion}) and
|
|
concatenates them together, with a separator between them. This creates
|
|
a single string that describes the values of the separate indices. The
|
|
combined string is used as a single index into an ordinary,
|
|
one-dimensional array. The separator used is the value of the built-in
|
|
variable @code{SUBSEP}.
|
|
|
|
For example, suppose we evaluate the expression @samp{foo[5,12] = "value"}
|
|
when the value of @code{SUBSEP} is @code{"@@"}. The numbers 5 and 12 are
|
|
converted to strings and
|
|
concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus,
|
|
the array element @code{foo["5@@12"]} is set to @code{"value"}.
|
|
|
|
Once the element's value is stored, @command{awk} has no record of whether
|
|
it was stored with a single index or a sequence of indices. The two
|
|
expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always
|
|
equivalent.
|
|
|
|
The default value of @code{SUBSEP} is the string @code{"\034"},
|
|
which contains a nonprinting character that is unlikely to appear in an
|
|
@command{awk} program or in most input data.
|
|
The usefulness of choosing an unlikely character comes from the fact
|
|
that index values that contain a string matching @code{SUBSEP} can lead to
|
|
combined strings that are ambiguous. Suppose that @code{SUBSEP} is
|
|
@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a",
|
|
"b@@c"]}} are indistinguishable because both are actually
|
|
stored as @samp{foo["a@@b@@c"]}.
|
|
|
|
To test whether a particular index sequence exists in a
|
|
multidimensional array, use the same operator (@samp{in}) that is
|
|
used for single dimensional arrays. Write the whole sequence of indices
|
|
in parentheses, separated by commas, as the left operand:
|
|
|
|
@example
|
|
(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array}
|
|
@end example
|
|
|
|
The following example treats its input as a two-dimensional array of
|
|
fields; it rotates this array 90 degrees clockwise and prints the
|
|
result. It assumes that all lines have the same number of
|
|
elements:
|
|
|
|
@example
|
|
@{
|
|
if (max_nf < NF)
|
|
max_nf = NF
|
|
max_nr = NR
|
|
for (x = 1; x <= NF; x++)
|
|
vector[x, NR] = $x
|
|
@}
|
|
|
|
END @{
|
|
for (x = 1; x <= max_nf; x++) @{
|
|
for (y = max_nr; y >= 1; --y)
|
|
printf("%s ", vector[x, y])
|
|
printf("\n")
|
|
@}
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
When given the input:
|
|
|
|
@example
|
|
1 2 3 4 5 6
|
|
2 3 4 5 6 1
|
|
3 4 5 6 1 2
|
|
4 5 6 1 2 3
|
|
@end example
|
|
|
|
@noindent
|
|
the program produces the following output:
|
|
|
|
@example
|
|
4 3 2 1
|
|
5 4 3 2
|
|
6 5 4 3
|
|
1 6 5 4
|
|
2 1 6 5
|
|
3 2 1 6
|
|
@end example
|
|
|
|
@node Multi-scanning
|
|
@section Scanning Multidimensional Arrays
|
|
|
|
There is no special @code{for} statement for scanning a
|
|
``multidimensional'' array. There cannot be one, because, in truth, there
|
|
are no multidimensional arrays or elements---there is only a
|
|
multidimensional @emph{way of accessing} an array.
|
|
|
|
@cindex subscripts in arrays, multidimensional, scanning
|
|
@cindex arrays, multidimensional, scanning
|
|
However, if your program has an array that is always accessed as
|
|
multidimensional, you can get the effect of scanning it by combining
|
|
the scanning @code{for} statement
|
|
(@pxref{Scanning an Array}) with the
|
|
built-in @code{split} function
|
|
(@pxref{String Functions}).
|
|
It works in the following manner:
|
|
|
|
@example
|
|
for (combined in array) @{
|
|
split(combined, separate, SUBSEP)
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
This sets the variable @code{combined} to
|
|
each concatenated combined index in the array, and splits it
|
|
into the individual indices by breaking it apart where the value of
|
|
@code{SUBSEP} appears. The individual indices then become the elements of
|
|
the array @code{separate}.
|
|
|
|
Thus, if a value is previously stored in @code{array[1, "foo"]}; then
|
|
an element with index @code{"1\034foo"} exists in @code{array}. (Recall
|
|
that the default value of @code{SUBSEP} is the character with code 034.)
|
|
Sooner or later, the @code{for} statement finds that index and does an
|
|
iteration with the variable @code{combined} set to @code{"1\034foo"}.
|
|
Then the @code{split} function is called as follows:
|
|
|
|
@example
|
|
split("1\034foo", separate, "\034")
|
|
@end example
|
|
|
|
@noindent
|
|
The result is to set @code{separate[1]} to @code{"1"} and
|
|
@code{separate[2]} to @code{"foo"}. Presto! The original sequence of
|
|
separate indices is recovered.
|
|
|
|
@node Array Sorting
|
|
@section Sorting Array Values and Indices with @command{gawk}
|
|
|
|
@cindex arrays, sorting
|
|
@cindex @code{asort} function (@command{gawk})
|
|
@c last comma does NOT start a tertiary
|
|
@cindex @code{asort} function (@command{gawk}), arrays, sorting
|
|
@cindex sort function, arrays, sorting
|
|
The order in which an array is scanned with a @samp{for (i in array)}
|
|
loop is essentially arbitrary.
|
|
In most @command{awk} implementations, sorting an array requires
|
|
writing a @code{sort} function.
|
|
While this can be educational for exploring different sorting algorithms,
|
|
usually that's not the point of the program.
|
|
@command{gawk} provides the built-in @code{asort}
|
|
and @code{asorti} functions
|
|
(@pxref{String Functions})
|
|
for sorting arrays. For example:
|
|
|
|
@example
|
|
@var{populate the array} data
|
|
n = asort(data)
|
|
for (i = 1; i <= n; i++)
|
|
@var{do something with} data[i]
|
|
@end example
|
|
|
|
After the call to @code{asort}, the array @code{data} is indexed from 1
|
|
to some number @var{n}, the total number of elements in @code{data}.
|
|
(This count is @code{asort}'s return value.)
|
|
@code{data[1]} @value{LEQ} @code{data[2]} @value{LEQ} @code{data[3]}, and so on.
|
|
The comparison of array elements is done
|
|
using @command{gawk}'s usual comparison rules
|
|
(@pxref{Typing and Comparison}).
|
|
|
|
@cindex side effects, @code{asort} function
|
|
An important side effect of calling @code{asort} is that
|
|
@emph{the array's original indices are irrevocably lost}.
|
|
As this isn't always desirable, @code{asort} accepts a
|
|
second argument:
|
|
|
|
@example
|
|
@var{populate the array} source
|
|
n = asort(source, dest)
|
|
for (i = 1; i <= n; i++)
|
|
@var{do something with} dest[i]
|
|
@end example
|
|
|
|
In this case, @command{gawk} copies the @code{source} array into the
|
|
@code{dest} array and then sorts @code{dest}, destroying its indices.
|
|
However, the @code{source} array is not affected.
|
|
|
|
Often, what's needed is to sort on the values of the @emph{indices}
|
|
instead of the values of the elements.
|
|
To do that, starting with @command{gawk} 3.1.2, use the
|
|
@code{asorti} function. The interface is identical to that of
|
|
@code{asort}, except that the index values are used for sorting, and
|
|
become the values of the result array:
|
|
|
|
@example
|
|
@{ source[$0] = some_func($0) @}
|
|
|
|
END @{
|
|
n = asorti(source, dest)
|
|
for (i = 1; i <= n; i++)
|
|
@var{do something with} dest[i]
|
|
@}
|
|
@end example
|
|
|
|
If your version of @command{gawk} is 3.1.0 or 3.1.1, you don't
|
|
have @code{asorti}. Instead, use a helper array
|
|
to hold the sorted index values, and then access the original array's
|
|
elements. It works in the following way:
|
|
|
|
@example
|
|
@var{populate the array} data
|
|
# copy indices
|
|
j = 1
|
|
for (i in data) @{
|
|
ind[j] = i # index value becomes element value
|
|
j++
|
|
@}
|
|
n = asort(ind) # index values are now sorted
|
|
for (i = 1; i <= n; i++)
|
|
@var{do something with} data[ind[i]]
|
|
@end example
|
|
|
|
Sorting the array by replacing the indices provides maximal flexibility.
|
|
To traverse the elements in decreasing order, use a loop that goes from
|
|
@var{n} down to 1, either over the elements or over the indices.
|
|
|
|
@cindex reference counting, sorting arrays
|
|
Copying array indices and elements isn't expensive in terms of memory.
|
|
Internally, @command{gawk} maintains @dfn{reference counts} to data.
|
|
For example, when @code{asort} copies the first array to the second one,
|
|
there is only one copy of the original array elements' data, even though
|
|
both arrays use the values. Similarly, when copying the indices from
|
|
@code{data} to @code{ind}, there is only one copy of the actual index
|
|
strings.
|
|
|
|
@c Document It And Call It A Feature. Sigh.
|
|
@cindex arrays, sorting, @code{IGNORECASE} variable and
|
|
@cindex @code{IGNORECASE} variable, array sorting and
|
|
We said previously that comparisons are done using @command{gawk}'s
|
|
``usual comparison rules.'' Because @code{IGNORECASE} affects
|
|
string comparisons, the value of @code{IGNORECASE} also
|
|
affects sorting for both @code{asort} and @code{asorti}.
|
|
Caveat Emptor.
|
|
@c ENDOFRANGE arrs
|
|
|
|
@node Functions
|
|
@chapter Functions
|
|
|
|
@c STARTOFRANGE funcbi
|
|
@cindex functions, built-in
|
|
@c STARTOFRANGE bifunc
|
|
@cindex built-in functions
|
|
This @value{CHAPTER} describes @command{awk}'s built-in functions,
|
|
which fall into three categories: numeric, string, and I/O.
|
|
@command{gawk} provides additional groups of functions
|
|
to work with values that represent time, do
|
|
bit manipulation, and internationalize and localize programs.
|
|
|
|
Besides the built-in functions, @command{awk} has provisions for
|
|
writing new functions that the rest of a program can use.
|
|
The second half of this @value{CHAPTER} describes these
|
|
@dfn{user-defined} functions.
|
|
|
|
@menu
|
|
* Built-in:: Summarizes the built-in functions.
|
|
* User-defined:: Describes User-defined functions in detail.
|
|
@end menu
|
|
|
|
@node Built-in
|
|
@section Built-in Functions
|
|
|
|
@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
|
|
@dfn{Built-in} functions are always available for
|
|
your @command{awk} program to call. This @value{SECTION} defines all
|
|
the built-in
|
|
functions in @command{awk}; some of these are mentioned in other sections
|
|
but are summarized here for your convenience.
|
|
|
|
@menu
|
|
* Calling Built-in:: How to call built-in functions.
|
|
* Numeric Functions:: Functions that work with numbers, including
|
|
@code{int}, @code{sin} and @code{rand}.
|
|
* String Functions:: Functions for string manipulation, such as
|
|
@code{split}, @code{match} and @code{sprintf}.
|
|
* I/O Functions:: Functions for files and shell commands.
|
|
* Time Functions:: Functions for dealing with timestamps.
|
|
* Bitwise Functions:: Functions for bitwise operations.
|
|
* I18N Functions:: Functions for string translation.
|
|
@end menu
|
|
|
|
@node Calling Built-in
|
|
@subsection Calling Built-in Functions
|
|
|
|
To call one of @command{awk}'s built-in functions, write the name of
|
|
the function followed
|
|
by arguments in parentheses. For example, @samp{atan2(y + z, 1)}
|
|
is a call to the function @code{atan2} and has two arguments.
|
|
|
|
@cindex programming conventions, functions, calling
|
|
@c last comma does NOT start a tertiary
|
|
@cindex whitespace, functions, calling
|
|
Whitespace is ignored between the built-in function name and the
|
|
open parenthesis, and it is good practice to avoid using whitespace
|
|
there. User-defined functions do not permit whitespace in this way, and
|
|
it is easier to avoid mistakes by following a simple
|
|
convention that always works---no whitespace after a function name.
|
|
|
|
@c last comma is part of tertiary
|
|
@cindex troubleshooting, @command{gawk}, fatal errors, function arguments
|
|
@cindex @command{gawk}, function arguments and
|
|
@cindex differences in @command{awk} and @command{gawk}, function arguments (@command{gawk})
|
|
Each built-in function accepts a certain number of arguments.
|
|
In some cases, arguments can be omitted. The defaults for omitted
|
|
arguments vary from function to function and are described under the
|
|
individual functions. In some @command{awk} implementations, extra
|
|
arguments given to built-in functions are ignored. However, in @command{gawk},
|
|
it is a fatal error to give extra arguments to a built-in function.
|
|
|
|
When a function is called, expressions that create the function's actual
|
|
parameters are evaluated completely before the call is performed.
|
|
For example, in the following code fragment:
|
|
|
|
@example
|
|
i = 4
|
|
j = sqrt(i++)
|
|
@end example
|
|
|
|
@cindex evaluation order, functions
|
|
@cindex functions, built-in, evaluation order
|
|
@cindex built-in functions, evaluation order
|
|
@noindent
|
|
the variable @code{i} is incremented to the value five before @code{sqrt}
|
|
is called with a value of four for its actual parameter.
|
|
The order of evaluation of the expressions used for the function's
|
|
parameters is undefined. Thus, avoid writing programs that
|
|
assume that parameters are evaluated from left to right or from
|
|
right to left. For example:
|
|
|
|
@example
|
|
i = 5
|
|
j = atan2(i++, i *= 2)
|
|
@end example
|
|
|
|
If the order of evaluation is left to right, then @code{i} first becomes
|
|
6, and then 12, and @code{atan2} is called with the two arguments 6
|
|
and 12. But if the order of evaluation is right to left, @code{i}
|
|
first becomes 10, then 11, and @code{atan2} is called with the
|
|
two arguments 11 and 10.
|
|
|
|
@node Numeric Functions
|
|
@subsection Numeric Functions
|
|
|
|
The following list describes all of
|
|
the built-in functions that work with numbers.
|
|
Optional parameters are enclosed in square brackets@w{ ([ ]):}
|
|
|
|
@table @code
|
|
@item int(@var{x})
|
|
@cindex @code{int} function
|
|
This returns the nearest integer to @var{x}, located between @var{x} and zero and
|
|
truncated toward zero.
|
|
|
|
For example, @code{int(3)} is 3, @code{int(3.9)} is 3, @code{int(-3.9)}
|
|
is @minus{}3, and @code{int(-3)} is @minus{}3 as well.
|
|
|
|
@item sqrt(@var{x})
|
|
@cindex @code{sqrt} function
|
|
This returns the positive square root of @var{x}.
|
|
@command{gawk} reports an error
|
|
if @var{x} is negative. Thus, @code{sqrt(4)} is 2.
|
|
|
|
@item exp(@var{x})
|
|
@cindex @code{exp} function
|
|
This returns the exponential of @var{x} (@code{e ^ @var{x}}) or reports
|
|
an error if @var{x} is out of range. The range of values @var{x} can have
|
|
depends on your machine's floating-point representation.
|
|
|
|
@item log(@var{x})
|
|
@cindex @code{log} function
|
|
This returns the natural logarithm of @var{x}, if @var{x} is positive;
|
|
otherwise, it reports an error.
|
|
|
|
@item sin(@var{x})
|
|
@cindex @code{sin} function
|
|
This returns the sine of @var{x}, with @var{x} in radians.
|
|
|
|
@item cos(@var{x})
|
|
@cindex @code{cos} function
|
|
This returns the cosine of @var{x}, with @var{x} in radians.
|
|
|
|
@item atan2(@var{y}, @var{x})
|
|
@cindex @code{atan2} function
|
|
This returns the arctangent of @code{@var{y} / @var{x}} in radians.
|
|
|
|
@item rand()
|
|
@cindex @code{rand} function
|
|
@cindex random numbers, @code{rand}/@code{srand} functions
|
|
This returns a random number. The values of @code{rand} are
|
|
uniformly distributed between zero and one.
|
|
The value could be zero but is never one.@footnote{The C version of @code{rand}
|
|
is known to produce fairly poor sequences of random numbers.
|
|
However, nothing requires that an @command{awk} implementation use the C
|
|
@code{rand} to implement the @command{awk} version of @code{rand}.
|
|
In fact, @command{gawk} uses the BSD @code{random} function, which is
|
|
considerably better than @code{rand}, to produce random numbers.}
|
|
|
|
Often random integers are needed instead. Following is a user-defined function
|
|
that can be used to obtain a random non-negative integer less than @var{n}:
|
|
|
|
@example
|
|
function randint(n) @{
|
|
return int(n * rand())
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
The multiplication produces a random number greater than zero and less
|
|
than @code{n}. Using @code{int}, this result is made into
|
|
an integer between zero and @code{n} @minus{} 1, inclusive.
|
|
|
|
The following example uses a similar function to produce random integers
|
|
between one and @var{n}. This program prints a new random number for
|
|
each input record:
|
|
|
|
@example
|
|
# Function to roll a simulated die.
|
|
function roll(n) @{ return 1 + int(rand() * n) @}
|
|
|
|
# Roll 3 six-sided dice and
|
|
# print total number of points.
|
|
@{
|
|
printf("%d points\n",
|
|
roll(6)+roll(6)+roll(6))
|
|
@}
|
|
@end example
|
|
|
|
@cindex numbers, random
|
|
@cindex random numbers, seed of
|
|
@c MAWK uses a different seed each time.
|
|
@strong{Caution:} In most @command{awk} implementations, including @command{gawk},
|
|
@code{rand} starts generating numbers from the same
|
|
starting number, or @dfn{seed}, each time you run @command{awk}. Thus,
|
|
a program generates the same results each time you run it.
|
|
The numbers are random within one @command{awk} run but predictable
|
|
from run to run. This is convenient for debugging, but if you want
|
|
a program to do different things each time it is used, you must change
|
|
the seed to a value that is different in each run. To do this,
|
|
use @code{srand}.
|
|
|
|
@item srand(@r{[}@var{x}@r{]})
|
|
@cindex @code{srand} function
|
|
The function @code{srand} sets the starting point, or seed,
|
|
for generating random numbers to the value @var{x}.
|
|
|
|
Each seed value leads to a particular sequence of random
|
|
numbers.@footnote{Computer-generated random numbers really are not truly
|
|
random. They are technically known as ``pseudorandom.'' This means
|
|
that while the numbers in a sequence appear to be random, you can in
|
|
fact generate the same sequence of random numbers over and over again.}
|
|
Thus, if the seed is set to the same value a second time,
|
|
the same sequence of random numbers is produced again.
|
|
|
|
Different @command{awk} implementations use different random-number
|
|
generators internally. Don't expect the same @command{awk} program
|
|
to produce the same series of random numbers when executed by
|
|
different versions of @command{awk}.
|
|
|
|
If the argument @var{x} is omitted, as in @samp{srand()}, then the current
|
|
date and time of day are used for a seed. This is the way to get random
|
|
numbers that are truly unpredictable.
|
|
|
|
The return value of @code{srand} is the previous seed. This makes it
|
|
easy to keep track of the seeds in case you need to consistently reproduce
|
|
sequences of random numbers.
|
|
@end table
|
|
|
|
@node String Functions
|
|
@subsection String-Manipulation Functions
|
|
|
|
The functions in this @value{SECTION} look at or change the text of one or more
|
|
strings.
|
|
Optional parameters are enclosed in square brackets@w{ ([ ]).}
|
|
Those functions that are
|
|
specific to @command{gawk} are marked with a pound sign@w{ (@samp{#}):}
|
|
|
|
@menu
|
|
* Gory Details:: More than you want to know about @samp{\} and
|
|
@samp{&} with @code{sub}, @code{gsub}, and
|
|
@code{gensub}.
|
|
@end menu
|
|
|
|
@table @code
|
|
@item asort(@var{source} @r{[}, @var{dest}@r{]}) #
|
|
@cindex arrays, elements, retrieving number of
|
|
@cindex @code{asort} function (@command{gawk})
|
|
@code{asort} is a @command{gawk}-specific extension, returning the number of
|
|
elements in the array @var{source}. The contents of @var{source} are
|
|
sorted using @command{gawk}'s normal rules for comparing values
|
|
(in particular, @code{IGNORECASE} affects the sorting)
|
|
and the indices
|
|
of the sorted values of @var{source} are replaced with sequential
|
|
integers starting with one. If the optional array @var{dest} is specified,
|
|
then @var{source} is duplicated into @var{dest}. @var{dest} is then
|
|
sorted, leaving the indices of @var{source} unchanged.
|
|
For example, if the contents of @code{a} are as follows:
|
|
|
|
@example
|
|
a["last"] = "de"
|
|
a["first"] = "sac"
|
|
a["middle"] = "cul"
|
|
@end example
|
|
|
|
@noindent
|
|
A call to @code{asort}:
|
|
|
|
@example
|
|
asort(a)
|
|
@end example
|
|
|
|
@noindent
|
|
results in the following contents of @code{a}:
|
|
|
|
@example
|
|
a[1] = "cul"
|
|
a[2] = "de"
|
|
a[3] = "sac"
|
|
@end example
|
|
|
|
The @code{asort} function is described in more detail in
|
|
@ref{Array Sorting}.
|
|
@code{asort} is a @command{gawk} extension; it is not available
|
|
in compatibility mode (@pxref{Options}).
|
|
|
|
@item asorti(@var{source} @r{[}, @var{dest}@r{]}) #
|
|
@cindex @code{asorti} function (@command{gawk})
|
|
@code{asorti} is a @command{gawk}-specific extension, returning the number of
|
|
elements in the array @var{source}.
|
|
It works similarly to @code{asort}, however, the @emph{indices}
|
|
are sorted, instead of the values. As array indices are always strings,
|
|
the comparison performed is always a string comparison. (Here too,
|
|
@code{IGNORECASE} affects the sorting.)
|
|
|
|
The @code{asorti} function is described in more detail in
|
|
@ref{Array Sorting}.
|
|
It was added in @command{gawk} 3.1.2.
|
|
@code{asorti} is a @command{gawk} extension; it is not available
|
|
in compatibility mode (@pxref{Options}).
|
|
|
|
@item index(@var{in}, @var{find})
|
|
@cindex @code{index} function
|
|
@cindex searching
|
|
This searches the string @var{in} for the first occurrence of the string
|
|
@var{find}, and returns the position in characters where that occurrence
|
|
begins in the string @var{in}. Consider the following example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{ print index("peanut", "an") @}'
|
|
@print{} 3
|
|
@end example
|
|
|
|
@noindent
|
|
If @var{find} is not found, @code{index} returns zero.
|
|
(Remember that string indices in @command{awk} start at one.)
|
|
|
|
@item length(@r{[}@var{string}@r{]})
|
|
@cindex @code{length} function
|
|
This returns the number of characters in @var{string}. If
|
|
@var{string} is a number, the length of the digit string representing
|
|
that number is returned. For example, @code{length("abcde")} is 5. By
|
|
contrast, @code{length(15 * 35)} works out to 3. In this example, 15 * 35 =
|
|
525, and 525 is then converted to the string @code{"525"}, which has
|
|
three characters.
|
|
|
|
If no argument is supplied, @code{length} returns the length of @code{$0}.
|
|
|
|
@c @cindex historical features
|
|
@cindex portability, @code{length} function
|
|
@cindex POSIX @command{awk}, functions and, @code{length}
|
|
@strong{Note:}
|
|
In older versions of @command{awk}, the @code{length} function could
|
|
be called
|
|
without any parentheses. Doing so is marked as ``deprecated'' in the
|
|
POSIX standard. This means that while a program can do this,
|
|
it is a feature that can eventually be removed from a future
|
|
version of the standard. Therefore, for programs to be maximally portable,
|
|
always supply the parentheses.
|
|
|
|
@item match(@var{string}, @var{regexp} @r{[}, @var{array}@r{]})
|
|
@cindex @code{match} function
|
|
The @code{match} function searches @var{string} for the
|
|
longest, leftmost substring matched by the regular expression,
|
|
@var{regexp}. It returns the character position, or @dfn{index},
|
|
at which that substring begins (one, if it starts at the beginning of
|
|
@var{string}). If no match is found, it returns zero.
|
|
|
|
The @var{regexp} argument may be either a regexp constant
|
|
(@samp{/@dots{}/}) or a string constant (@var{"@dots{}"}).
|
|
In the latter case, the string is treated as a regexp to be matched.
|
|
@ref{Computed Regexps}, for a
|
|
discussion of the difference between the two forms, and the
|
|
implications for writing your program correctly.
|
|
|
|
The order of the first two arguments is backwards from most other string
|
|
functions that work with regular expressions, such as
|
|
@code{sub} and @code{gsub}. It might help to remember that
|
|
for @code{match}, the order is the same as for the @samp{~} operator:
|
|
@samp{@var{string} ~ @var{regexp}}.
|
|
|
|
@cindex @code{RSTART} variable, @code{match} function and
|
|
@cindex @code{RLENGTH} variable, @code{match} function and
|
|
@cindex @code{match} function, @code{RSTART}/@code{RLENGTH} variables
|
|
The @code{match} function sets the built-in variable @code{RSTART} to
|
|
the index. It also sets the built-in variable @code{RLENGTH} to the
|
|
length in characters of the matched substring. If no match is found,
|
|
@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1.
|
|
|
|
For example:
|
|
|
|
@example
|
|
@c file eg/misc/findpat.awk
|
|
@{
|
|
if ($1 == "FIND")
|
|
regex = $2
|
|
else @{
|
|
where = match($0, regex)
|
|
if (where != 0)
|
|
print "Match of", regex, "found at",
|
|
where, "in", $0
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
This program looks for lines that match the regular expression stored in
|
|
the variable @code{regex}. This regular expression can be changed. If the
|
|
first word on a line is @samp{FIND}, @code{regex} is changed to be the
|
|
second word on that line. Therefore, if given:
|
|
|
|
@example
|
|
@c file eg/misc/findpat.data
|
|
FIND ru+n
|
|
My program runs
|
|
but not very quickly
|
|
FIND Melvin
|
|
JF+KM
|
|
This line is property of Reality Engineering Co.
|
|
Melvin was here.
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
@command{awk} prints:
|
|
|
|
@example
|
|
Match of ru+n found at 12 in My program runs
|
|
Match of Melvin found at 1 in Melvin was here.
|
|
@end example
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{match} function
|
|
If @var{array} is present, it is cleared, and then the 0th element
|
|
of @var{array} is set to the entire portion of @var{string}
|
|
matched by @var{regexp}. If @var{regexp} contains parentheses,
|
|
the integer-indexed elements of @var{array} are set to contain the
|
|
portion of @var{string} matching the corresponding parenthesized
|
|
subexpression.
|
|
For example:
|
|
|
|
@example
|
|
$ echo foooobazbarrrrr |
|
|
> gawk '@{ match($0, /(fo+).+(bar*)/, arr)
|
|
> print arr[1], arr[2] @}'
|
|
@print{} foooo barrrrr
|
|
@end example
|
|
|
|
In addition,
|
|
beginning with @command{gawk} 3.1.2,
|
|
multidimensional subscripts are available providing
|
|
the start index and length of each matched subexpression:
|
|
|
|
@example
|
|
$ echo foooobazbarrrrr |
|
|
> gawk '@{ match($0, /(fo+).+(bar*)/, arr)
|
|
> print arr[1], arr[2]
|
|
> print arr[1, "start"], arr[1, "length"]
|
|
> print arr[2, "start"], arr[2, "length"]
|
|
> @}'
|
|
@print{} foooo barrrrr
|
|
@print{} 1 5
|
|
@print{} 9 7
|
|
@end example
|
|
|
|
There may not be subscripts for the start and index for every parenthesized
|
|
subexpressions, since they may not all have matched text; thus they
|
|
should be tested for with the @code{in} operator
|
|
(@pxref{Reference to Elements}).
|
|
|
|
@cindex troubleshooting, @code{match} function
|
|
The @var{array} argument to @code{match} is a
|
|
@command{gawk} extension. In compatibility mode
|
|
(@pxref{Options}),
|
|
using a third argument is a fatal error.
|
|
|
|
@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})
|
|
@cindex @code{split} function
|
|
This function divides @var{string} into pieces separated by @var{fieldsep}
|
|
and stores the pieces in @var{array}. The first piece is stored in
|
|
@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
|
|
forth. The string value of the third argument, @var{fieldsep}, is
|
|
a regexp describing where to split @var{string} (much as @code{FS} can
|
|
be a regexp describing where to split input records). If
|
|
@var{fieldsep} is omitted, the value of @code{FS} is used.
|
|
@code{split} returns the number of elements created.
|
|
|
|
The @code{split} function splits strings into pieces in a
|
|
manner similar to the way input lines are split into fields. For example:
|
|
|
|
@example
|
|
split("cul-de-sac", a, "-")
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex strings, splitting
|
|
splits the string @samp{cul-de-sac} into three fields using @samp{-} as the
|
|
separator. It sets the contents of the array @code{a} as follows:
|
|
|
|
@example
|
|
a[1] = "cul"
|
|
a[2] = "de"
|
|
a[3] = "sac"
|
|
@end example
|
|
|
|
@noindent
|
|
The value returned by this call to @code{split} is three.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{split} function
|
|
As with input field-splitting, when the value of @var{fieldsep} is
|
|
@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements
|
|
are separated by runs of whitespace.
|
|
Also as with input field-splitting, if @var{fieldsep} is the null string, each
|
|
individual character in the string is split into its own array element.
|
|
(This is a @command{gawk}-specific extension.)
|
|
|
|
Note, however, that @code{RS} has no effect on the way @code{split}
|
|
works. Even though @samp{RS = ""} causes newline to also be an input
|
|
field separator, this does not affect how @code{split} splits strings.
|
|
|
|
@cindex dark corner, @code{split} function
|
|
Modern implementations of @command{awk}, including @command{gawk}, allow
|
|
the third argument to be a regexp constant (@code{/abc/}) as well as a
|
|
string.
|
|
@value{DARKCORNER}
|
|
The POSIX standard allows this as well.
|
|
@ref{Computed Regexps}, for a
|
|
discussion of the difference between using a string constant or a regexp constant,
|
|
and the implications for writing your program correctly.
|
|
|
|
Before splitting the string, @code{split} deletes any previously existing
|
|
elements in the array @var{array}.
|
|
|
|
If @var{string} is null, the array has no elements. (So this is a portable
|
|
way to delete an entire array with one statement.
|
|
@xref{Delete}.)
|
|
|
|
If @var{string} does not match @var{fieldsep} at all (but is not null),
|
|
@var{array} has one element only. The value of that element is the original
|
|
@var{string}.
|
|
|
|
@item sprintf(@var{format}, @var{expression1}, @dots{})
|
|
@cindex @code{sprintf} function
|
|
This returns (without printing) the string that @code{printf} would
|
|
have printed out with the same arguments
|
|
(@pxref{Printf}).
|
|
For example:
|
|
|
|
@example
|
|
pival = sprintf("pi = %.2f (approx.)", 22/7)
|
|
@end example
|
|
|
|
@noindent
|
|
assigns the string @w{@code{"pi = 3.14 (approx.)"}} to the variable @code{pival}.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{strtonum} function (@command{gawk})
|
|
@cindex @code{strtonum} function (@command{gawk})
|
|
@item strtonum(@var{str}) #
|
|
Examines @var{str} and returns its numeric value. If @var{str}
|
|
begins with a leading @samp{0}, @code{strtonum} assumes that @var{str}
|
|
is an octal number. If @var{str} begins with a leading @samp{0x} or
|
|
@samp{0X}, @code{strtonum} assumes that @var{str} is a hexadecimal number.
|
|
For example:
|
|
|
|
@example
|
|
$ echo 0x11 |
|
|
> gawk '@{ printf "%d\n", strtonum($1) @}'
|
|
@print{} 17
|
|
@end example
|
|
|
|
Using the @code{strtonum} function is @emph{not} the same as adding zero
|
|
to a string value; the automatic coercion of strings to numbers
|
|
works only for decimal data, not for octal or hexadecimal.@footnote{Unless
|
|
you use the @option{--non-decimal-data} option, which isn't recommended.
|
|
@xref{Nondecimal Data}, for more information.}
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{strtonum} function (@command{gawk})
|
|
@code{strtonum} is a @command{gawk} extension; it is not available
|
|
in compatibility mode (@pxref{Options}).
|
|
|
|
@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
|
|
@cindex @code{sub} function
|
|
The @code{sub} function alters the value of @var{target}.
|
|
It searches this value, which is treated as a string, for the
|
|
leftmost, longest substring matched by the regular expression @var{regexp}.
|
|
Then the entire string is
|
|
changed by replacing the matched text with @var{replacement}.
|
|
The modified string becomes the new value of @var{target}.
|
|
|
|
The @var{regexp} argument may be either a regexp constant
|
|
(@samp{/@dots{}/}) or a string constant (@var{"@dots{}"}).
|
|
In the latter case, the string is treated as a regexp to be matched.
|
|
@ref{Computed Regexps}, for a
|
|
discussion of the difference between the two forms, and the
|
|
implications for writing your program correctly.
|
|
|
|
This function is peculiar because @var{target} is not simply
|
|
used to compute a value, and not just any expression will do---it
|
|
must be a variable, field, or array element so that @code{sub} can
|
|
store a modified value there. If this argument is omitted, then the
|
|
default is to use and alter @code{$0}.@footnote{Note that this means
|
|
that the record will first be regenerated using the value of @code{OFS} if
|
|
any fields have been changed, and that the fields will be updated
|
|
after the substituion, even if the operation is a ``no-op'' such
|
|
as @samp{sub(/^/, "")}.}
|
|
For example:
|
|
|
|
@example
|
|
str = "water, water, everywhere"
|
|
sub(/at/, "ith", str)
|
|
@end example
|
|
|
|
@noindent
|
|
sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the
|
|
leftmost longest occurrence of @samp{at} with @samp{ith}.
|
|
|
|
The @code{sub} function returns the number of substitutions made (either
|
|
one or zero).
|
|
|
|
If the special character @samp{&} appears in @var{replacement}, it
|
|
stands for the precise substring that was matched by @var{regexp}. (If
|
|
the regexp can match more than one string, then this precise substring
|
|
may vary.) For example:
|
|
|
|
@example
|
|
@{ sub(/candidate/, "& and his wife"); print @}
|
|
@end example
|
|
|
|
@noindent
|
|
changes the first occurrence of @samp{candidate} to @samp{candidate
|
|
and his wife} on each input line.
|
|
Here is another example:
|
|
|
|
@example
|
|
$ awk 'BEGIN @{
|
|
> str = "daabaaa"
|
|
> sub(/a+/, "C&C", str)
|
|
> print str
|
|
> @}'
|
|
@print{} dCaaCbaaa
|
|
@end example
|
|
|
|
@noindent
|
|
This shows how @samp{&} can represent a nonconstant string and also
|
|
illustrates the ``leftmost, longest'' rule in regexp matching
|
|
(@pxref{Leftmost Longest}).
|
|
|
|
The effect of this special character (@samp{&}) can be turned off by putting a
|
|
backslash before it in the string. As usual, to insert one backslash in
|
|
the string, you must write two backslashes. Therefore, write @samp{\\&}
|
|
in a string constant to include a literal @samp{&} in the replacement.
|
|
For example, the following shows how to replace the first @samp{|} on each line with
|
|
an @samp{&}:
|
|
|
|
@example
|
|
@{ sub(/\|/, "\\&"); print @}
|
|
@end example
|
|
|
|
@cindex @code{sub} function, arguments of
|
|
@cindex @code{gsub} function, arguments of
|
|
As mentioned, the third argument to @code{sub} must
|
|
be a variable, field or array reference.
|
|
Some versions of @command{awk} allow the third argument to
|
|
be an expression that is not an lvalue. In such a case, @code{sub}
|
|
still searches for the pattern and returns zero or one, but the result of
|
|
the substitution (if any) is thrown away because there is no place
|
|
to put it. Such versions of @command{awk} accept expressions
|
|
such as the following:
|
|
|
|
@example
|
|
sub(/USA/, "United States", "the USA and Canada")
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex troubleshooting, @code{gsub}/@code{sub} functions
|
|
For historical compatibility, @command{gawk} accepts erroneous code,
|
|
such as in the previous example. However, using any other nonchangeable
|
|
object as the third parameter causes a fatal error and your program
|
|
will not run.
|
|
|
|
Finally, if the @var{regexp} is not a regexp constant, it is converted into a
|
|
string, and then the value of that string is treated as the regexp to match.
|
|
|
|
@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
|
|
@cindex @code{gsub} function
|
|
This is similar to the @code{sub} function, except @code{gsub} replaces
|
|
@emph{all} of the longest, leftmost, @emph{nonoverlapping} matching
|
|
substrings it can find. The @samp{g} in @code{gsub} stands for
|
|
``global,'' which means replace everywhere. For example:
|
|
|
|
@example
|
|
@{ gsub(/Britain/, "United Kingdom"); print @}
|
|
@end example
|
|
|
|
@noindent
|
|
replaces all occurrences of the string @samp{Britain} with @samp{United
|
|
Kingdom} for all input records.
|
|
|
|
The @code{gsub} function returns the number of substitutions made. If
|
|
the variable to search and alter (@var{target}) is
|
|
omitted, then the entire input record (@code{$0}) is used.
|
|
As in @code{sub}, the characters @samp{&} and @samp{\} are special,
|
|
and the third argument must be assignable.
|
|
|
|
@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]}) #
|
|
@cindex @code{gensub} function (@command{gawk})
|
|
@code{gensub} is a general substitution function. Like @code{sub} and
|
|
@code{gsub}, it searches the target string @var{target} for matches of
|
|
the regular expression @var{regexp}. Unlike @code{sub} and @code{gsub},
|
|
the modified string is returned as the result of the function and the
|
|
original target string is @emph{not} changed. If @var{how} is a string
|
|
beginning with @samp{g} or @samp{G}, then it replaces all matches of
|
|
@var{regexp} with @var{replacement}. Otherwise, @var{how} is treated
|
|
as a number that indicates which match of @var{regexp} to replace. If
|
|
no @var{target} is supplied, @code{$0} is used.
|
|
|
|
@code{gensub} provides an additional feature that is not available
|
|
in @code{sub} or @code{gsub}: the ability to specify components of a
|
|
regexp in the replacement text. This is done by using parentheses in
|
|
the regexp to mark the components and then specifying @samp{\@var{N}}
|
|
in the replacement text, where @var{N} is a digit from 1 to 9.
|
|
For example:
|
|
|
|
@example
|
|
$ gawk '
|
|
> BEGIN @{
|
|
> a = "abc def"
|
|
> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
|
|
> print b
|
|
> @}'
|
|
@print{} def abc
|
|
@end example
|
|
|
|
@noindent
|
|
As with @code{sub}, you must type two backslashes in order
|
|
to get one into the string.
|
|
In the replacement text, the sequence @samp{\0} represents the entire
|
|
matched text, as does the character @samp{&}.
|
|
|
|
The following example shows how you can use the third argument to control
|
|
which match of the regexp should be changed:
|
|
|
|
@example
|
|
$ echo a b c a b c |
|
|
> gawk '@{ print gensub(/a/, "AA", 2) @}'
|
|
@print{} a b c AA b c
|
|
@end example
|
|
|
|
In this case, @code{$0} is used as the default target string.
|
|
@code{gensub} returns the new string as its result, which is
|
|
passed directly to @code{print} for printing.
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
If the @var{how} argument is a string that does not begin with @samp{g} or
|
|
@samp{G}, or if it is a number that is less than or equal to zero, only one
|
|
substitution is performed. If @var{how} is zero, @command{gawk} issues
|
|
a warning message.
|
|
|
|
If @var{regexp} does not match @var{target}, @code{gensub}'s return value
|
|
is the original unchanged value of @var{target}.
|
|
|
|
@code{gensub} is a @command{gawk} extension; it is not available
|
|
in compatibility mode (@pxref{Options}).
|
|
|
|
@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]})
|
|
@cindex @code{substr} function
|
|
This returns a @var{length}-character-long substring of @var{string},
|
|
starting at character number @var{start}. The first character of a
|
|
string is character number one.@footnote{This is different from
|
|
C and C++, in which the first character is number zero.}
|
|
For example, @code{substr("washington", 5, 3)} returns @code{"ing"}.
|
|
|
|
If @var{length} is not present, this function returns the whole suffix of
|
|
@var{string} that begins at character number @var{start}. For example,
|
|
@code{substr("washington", 5)} returns @code{"ington"}. The whole
|
|
suffix is also returned
|
|
if @var{length} is greater than the number of characters remaining
|
|
in the string, counting from character @var{start}.
|
|
|
|
If @var{start} is less than one, @code{substr} treats it as
|
|
if it was one. (POSIX doesn't specify what to do in this case:
|
|
Unix @command{awk} acts this way, and therefore @command{gawk}
|
|
does too.)
|
|
If @var{start} is greater than the number of characters
|
|
in the string, @code{substr} returns the null string.
|
|
Similarly, if @var{length} is present but less than or equal to zero,
|
|
the null string is returned.
|
|
|
|
@cindex troubleshooting, @code{substr} function
|
|
The string returned by @code{substr} @emph{cannot} be
|
|
assigned. Thus, it is a mistake to attempt to change a portion of
|
|
a string, as shown in the following example:
|
|
|
|
@example
|
|
string = "abcdef"
|
|
# try to get "abCDEf", won't work
|
|
substr(string, 3, 3) = "CDE"
|
|
@end example
|
|
|
|
@noindent
|
|
It is also a mistake to use @code{substr} as the third argument
|
|
of @code{sub} or @code{gsub}:
|
|
|
|
@example
|
|
gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG
|
|
@end example
|
|
|
|
@cindex portability, @code{substr} function
|
|
(Some commercial versions of @command{awk} do in fact let you use
|
|
@code{substr} this way, but doing so is not portable.)
|
|
|
|
If you need to replace bits and pieces of a string, combine @code{substr}
|
|
with string concatenation, in the following manner:
|
|
|
|
@example
|
|
string = "abcdef"
|
|
@dots{}
|
|
string = substr(string, 1, 2) "CDE" substr(string, 6)
|
|
@end example
|
|
|
|
@cindex case sensitivity, converting case
|
|
@cindex converting, case
|
|
@item tolower(@var{string})
|
|
@cindex @code{tolower} function
|
|
This returns a copy of @var{string}, with each uppercase character
|
|
in the string replaced with its corresponding lowercase character.
|
|
Nonalphabetic characters are left unchanged. For example,
|
|
@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}.
|
|
|
|
@item toupper(@var{string})
|
|
@cindex @code{toupper} function
|
|
This returns a copy of @var{string}, with each lowercase character
|
|
in the string replaced with its corresponding uppercase character.
|
|
Nonalphabetic characters are left unchanged. For example,
|
|
@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}.
|
|
@end table
|
|
|
|
@node Gory Details
|
|
@subsubsection More About @samp{\} and @samp{&} with @code{sub}, @code{gsub}, and @code{gensub}
|
|
|
|
@cindex escape processing, @code{gsub}/@code{gensub}/@code{sub} functions
|
|
@cindex @code{sub} function, escape processing
|
|
@cindex @code{gsub} function, escape processing
|
|
@cindex @code{gensub} function (@command{gawk}), escape processing
|
|
@cindex @code{\} (backslash), @code{gsub}/@code{gensub}/@code{sub} functions and
|
|
@cindex backslash (@code{\}), @code{gsub}/@code{gensub}/@code{sub} functions and
|
|
@cindex @code{&} (ampersand), @code{gsub}/@code{gensub}/@code{sub} functions and
|
|
@cindex ampersand (@code{&}), @code{gsub}/@code{gensub}/@code{sub} functions and
|
|
When using @code{sub}, @code{gsub}, or @code{gensub}, and trying to get literal
|
|
backslashes and ampersands into the replacement text, you need to remember
|
|
that there are several levels of @dfn{escape processing} going on.
|
|
|
|
First, there is the @dfn{lexical} level, which is when @command{awk} reads
|
|
your program
|
|
and builds an internal copy of it that can be executed.
|
|
Then there is the runtime level, which is when @command{awk} actually scans the
|
|
replacement string to determine what to generate.
|
|
|
|
At both levels, @command{awk} looks for a defined set of characters that
|
|
can come after a backslash. At the lexical level, it looks for the
|
|
escape sequences listed in @ref{Escape Sequences}.
|
|
Thus, for every @samp{\} that @command{awk} processes at the runtime
|
|
level, type two backslashes at the lexical level.
|
|
When a character that is not valid for an escape sequence follows the
|
|
@samp{\}, Unix @command{awk} and @command{gawk} both simply remove the initial
|
|
@samp{\} and put the next character into the string. Thus, for
|
|
example, @code{"a\qb"} is treated as @code{"aqb"}.
|
|
|
|
At the runtime level, the various functions handle sequences of
|
|
@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex.
|
|
Historically, the @code{sub} and @code{gsub} functions treated the two
|
|
character sequence @samp{\&} specially; this sequence was replaced in
|
|
the generated text with a single @samp{&}. Any other @samp{\} within
|
|
the @var{replacement} string that did not precede an @samp{&} was passed
|
|
through unchanged. To illustrate with a table:
|
|
|
|
@c Thank to Karl Berry for help with the TeX stuff.
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{sub} sees!@code{sub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{\&}! @code{&}!the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr
|
|
@code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr
|
|
@code{\\q}! @code{\q}!a literal @samp{\q}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifnottex
|
|
@display
|
|
You type @code{sub} sees @code{sub} generates
|
|
-------- ---------- ---------------
|
|
@code{\&} @code{&} the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\&}
|
|
@code{\\\\\&} @code{\\&} a literal @samp{\&}
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\\&}
|
|
@code{\\q} @code{\q} a literal @samp{\q}
|
|
@end display
|
|
@end ifnottex
|
|
|
|
@noindent
|
|
This table shows both the lexical-level processing, where
|
|
an odd number of backslashes becomes an even number at the runtime level,
|
|
as well as the runtime processing done by @code{sub}.
|
|
(For the sake of simplicity, the rest of the following tables only show the
|
|
case of even numbers of backslashes entered at the lexical level.)
|
|
|
|
The problem with the historical approach is that there is no way to get
|
|
a literal @samp{\} followed by the matched text.
|
|
|
|
@c @cindex @command{awk} language, POSIX version
|
|
@cindex POSIX @command{awk}, functions and, @code{gsub}/@code{sub}
|
|
The 1992 POSIX standard attempted to fix this problem. The standard
|
|
says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&}
|
|
after the @samp{\}. If either one follows a @samp{\}, that character is
|
|
output literally. The interpretation of @samp{\} and @samp{&} then becomes:
|
|
|
|
@c thanks to Karl Berry for formatting this table
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{sub} sees!@code{sub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{&}! @code{&}!the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifnottex
|
|
@display
|
|
You type @code{sub} sees @code{sub} generates
|
|
-------- ---------- ---------------
|
|
@code{&} @code{&} the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
|
|
@end display
|
|
@end ifnottex
|
|
|
|
@noindent
|
|
This appears to solve the problem.
|
|
Unfortunately, the phrasing of the standard is unusual. It
|
|
says, in effect, that @samp{\} turns off the special meaning of any
|
|
following character, but for anything other than @samp{\} and @samp{&},
|
|
such special meaning is undefined. This wording leads to two problems:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Backslashes must now be doubled in the @var{replacement} string, breaking
|
|
historical @command{awk} programs.
|
|
|
|
@item
|
|
To make sure that an @command{awk} program is portable, @emph{every} character
|
|
in the @var{replacement} string must be preceded with a
|
|
backslash.@footnote{This consequence was certainly unintended.}
|
|
@c I can say that, 'cause I was involved in making this change
|
|
@end itemize
|
|
|
|
The POSIX standard is under revision.
|
|
Because of the problems just listed, proposed text for the revised standard
|
|
reverts to rules that correspond more closely to the original existing
|
|
practice. The proposed rules have special cases that make it possible
|
|
to produce a @samp{\} preceding the matched text:
|
|
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{sub} sees!@code{sub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\q}! @code{\q}!a literal @samp{\q}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifinfo
|
|
@display
|
|
You type @code{sub} sees @code{sub} generates
|
|
-------- ---------- ---------------
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\q} @code{\q} a literal @samp{\q}
|
|
@end display
|
|
@end ifinfo
|
|
|
|
In a nutshell, at the runtime level, there are now three special sequences
|
|
of characters (@samp{\\\&}, @samp{\\&} and @samp{\&}) whereas historically
|
|
there was only one. However, as in the historical case, any @samp{\} that
|
|
is not part of one of these three sequences is not special and appears
|
|
in the output literally.
|
|
|
|
@command{gawk} 3.0 and 3.1 follow these proposed POSIX rules for @code{sub} and
|
|
@code{gsub}.
|
|
@c As much as we think it's a lousy idea. You win some, you lose some. Sigh.
|
|
Whether these proposed rules will actually become codified into the
|
|
standard is unknown at this point. Subsequent @command{gawk} releases will
|
|
track the standard and implement whatever the final version specifies;
|
|
this @value{DOCUMENT} will be updated as
|
|
well.@footnote{As this @value{DOCUMENT} was being finalized,
|
|
we learned that the POSIX standard will not use these rules.
|
|
However, it was too late to change @command{gawk} for the 3.1 release.
|
|
@command{gawk} behaves as described here.}
|
|
|
|
The rules for @code{gensub} are considerably simpler. At the runtime
|
|
level, whenever @command{gawk} sees a @samp{\}, if the following character
|
|
is a digit, then the text that matched the corresponding parenthesized
|
|
subexpression is placed in the generated output. Otherwise,
|
|
no matter what character follows the @samp{\}, it
|
|
appears in the generated text and the @samp{\} does not:
|
|
|
|
@tex
|
|
\vbox{\bigskip
|
|
% This table has lots of &'s and \'s, so unspecialize them.
|
|
\catcode`\& = \other \catcode`\\ = \other
|
|
% But then we need character for escape and tab.
|
|
@catcode`! = 4
|
|
@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
|
|
You type!@code{gensub} sees!@code{gensub} generates@cr
|
|
@hrulefill!@hrulefill!@hrulefill@cr
|
|
@code{&}! @code{&}!the matched text@cr
|
|
@code{\\&}! @code{\&}!a literal @samp{&}@cr
|
|
@code{\\\\}! @code{\\}!a literal @samp{\}@cr
|
|
@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
|
|
@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
|
|
@code{\\q}! @code{\q}!a literal @samp{q}@cr
|
|
}
|
|
@bigskip}
|
|
@end tex
|
|
@ifnottex
|
|
@display
|
|
You type @code{gensub} sees @code{gensub} generates
|
|
-------- ------------- ------------------
|
|
@code{&} @code{&} the matched text
|
|
@code{\\&} @code{\&} a literal @samp{&}
|
|
@code{\\\\} @code{\\} a literal @samp{\}
|
|
@code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
|
|
@code{\\\\\\&} @code{\\\&} a literal @samp{\&}
|
|
@code{\\q} @code{\q} a literal @samp{q}
|
|
@end display
|
|
@end ifnottex
|
|
|
|
Because of the complexity of the lexical and runtime level processing
|
|
and the special cases for @code{sub} and @code{gsub},
|
|
we recommend the use of @command{gawk} and @code{gensub} when you have
|
|
to do substitutions.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Matching the Null String
|
|
@c last comma does NOT start tertiary
|
|
@cindex advanced features, null strings, matching
|
|
@cindex matching, null strings
|
|
@cindex null strings, matching
|
|
@c last comma in next two is part of tertiary
|
|
@cindex @code{*} (asterisk), @code{*} operator, null strings, matching
|
|
@cindex asterisk (@code{*}), @code{*} operator, null strings, matching
|
|
|
|
In @command{awk}, the @samp{*} operator can match the null string.
|
|
This is particularly important for the @code{sub}, @code{gsub},
|
|
and @code{gensub} functions. For example:
|
|
|
|
@example
|
|
$ echo abc | awk '@{ gsub(/m*/, "X"); print @}'
|
|
@print{} XaXbXcX
|
|
@end example
|
|
|
|
@noindent
|
|
Although this makes a certain amount of sense, it can be surprising.
|
|
|
|
@node I/O Functions
|
|
@subsection Input/Output Functions
|
|
|
|
The following functions relate to input/output (I/O).
|
|
Optional parameters are enclosed in square brackets ([ ]):
|
|
|
|
@table @code
|
|
@item close(@var{filename} @r{[}, @var{how}@r{]})
|
|
@cindex @code{close} function
|
|
@cindex files, closing
|
|
Close the file @var{filename} for input or output. Alternatively, the
|
|
argument may be a shell command that was used for creating a coprocess, or
|
|
for redirecting to or from a pipe; then the coprocess or pipe is closed.
|
|
@xref{Close Files And Pipes},
|
|
for more information.
|
|
|
|
When closing a coprocess, it is occasionally useful to first close
|
|
one end of the two-way pipe and then to close the other. This is done
|
|
by providing a second argument to @code{close}. This second argument
|
|
should be one of the two string values @code{"to"} or @code{"from"},
|
|
indicating which end of the pipe to close. Case in the string does
|
|
not matter.
|
|
@xref{Two-way I/O},
|
|
which discusses this feature in more detail and gives an example.
|
|
|
|
@item fflush(@r{[}@var{filename}@r{]})
|
|
@cindex @code{fflush} function
|
|
Flush any buffered output associated with @var{filename}, which is either a
|
|
file opened for writing or a shell command for redirecting output to
|
|
a pipe or coprocess.
|
|
|
|
@cindex portability, @code{fflush} function and
|
|
@cindex buffers, flushing
|
|
@cindex output, buffering
|
|
Many utility programs @dfn{buffer} their output; i.e., they save information
|
|
to write to a disk file or terminal in memory until there is enough
|
|
for it to be worthwhile to send the data to the output device.
|
|
This is often more efficient than writing
|
|
every little bit of information as soon as it is ready. However, sometimes
|
|
it is necessary to force a program to @dfn{flush} its buffers; that is,
|
|
write the information to its destination, even if a buffer is not full.
|
|
This is the purpose of the @code{fflush} function---@command{gawk} also
|
|
buffers its output and the @code{fflush} function forces
|
|
@command{gawk} to flush its buffers.
|
|
|
|
@code{fflush} was added to the Bell Laboratories research
|
|
version of @command{awk} in 1994; it is not part of the POSIX standard and is
|
|
not available if @option{--posix} has been specified on the
|
|
command line (@pxref{Options}).
|
|
|
|
@cindex @command{gawk}, @code{fflush} function in
|
|
@command{gawk} extends the @code{fflush} function in two ways. The first
|
|
is to allow no argument at all. In this case, the buffer for the
|
|
standard output is flushed. The second is to allow the null string
|
|
(@w{@code{""}}) as the argument. In this case, the buffers for
|
|
@emph{all} open output files and pipes are flushed.
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
@cindex troubleshooting, @code{fflush} function
|
|
@code{fflush} returns zero if the buffer is successfully flushed;
|
|
otherwise, it returns @minus{}1.
|
|
In the case where all buffers are flushed, the return value is zero
|
|
only if all buffers were flushed successfully. Otherwise, it is
|
|
@minus{}1, and @command{gawk} warns about the problem @var{filename}.
|
|
|
|
@command{gawk} also issues a warning message if you attempt to flush
|
|
a file or pipe that was opened for reading (such as with @code{getline}),
|
|
or if @var{filename} is not an open file, pipe, or coprocess.
|
|
In such a case, @code{fflush} returns @minus{}1, as well.
|
|
|
|
@item system(@var{command})
|
|
@cindex @code{system} function
|
|
@cindex interacting with other programs
|
|
Executes operating-system
|
|
commands and then returns to the @command{awk} program. The @code{system}
|
|
function executes the command given by the string @var{command}.
|
|
It returns the status returned by the command that was executed as
|
|
its value.
|
|
|
|
For example, if the following fragment of code is put in your @command{awk}
|
|
program:
|
|
|
|
@example
|
|
END @{
|
|
system("date | mail -s 'awk run done' root")
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
the system administrator is sent mail when the @command{awk} program
|
|
finishes processing input and begins its end-of-input processing.
|
|
|
|
Note that redirecting @code{print} or @code{printf} into a pipe is often
|
|
enough to accomplish your task. If you need to run many commands, it
|
|
is more efficient to simply print them down a pipeline to the shell:
|
|
|
|
@example
|
|
while (@var{more stuff to do})
|
|
print @var{command} | "/bin/sh"
|
|
close("/bin/sh")
|
|
@end example
|
|
|
|
@noindent
|
|
@cindex troubleshooting, @code{system} function
|
|
However, if your @command{awk}
|
|
program is interactive, @code{system} is useful for cranking up large
|
|
self-contained programs, such as a shell or an editor.
|
|
Some operating systems cannot implement the @code{system} function.
|
|
@code{system} causes a fatal error if it is not supported.
|
|
@end table
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Interactive Versus Noninteractive Buffering
|
|
@cindex advanced features, buffering
|
|
@cindex buffering, interactive vs. noninteractive
|
|
|
|
As a side point, buffering issues can be even more confusing, depending
|
|
upon whether your program is @dfn{interactive}, i.e., communicating
|
|
with a user sitting at a keyboard.@footnote{A program is interactive
|
|
if the standard output is connected
|
|
to a terminal device.}
|
|
|
|
@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for
|
|
@c motivating me to write this section.
|
|
Interactive programs generally @dfn{line buffer} their output; i.e., they
|
|
write out every line. Noninteractive programs wait until they have
|
|
a full buffer, which may be many lines of output.
|
|
Here is an example of the difference:
|
|
|
|
@example
|
|
$ awk '@{ print $1 + $2 @}'
|
|
1 1
|
|
@print{} 2
|
|
2 3
|
|
@print{} 5
|
|
@kbd{@value{CTL}-d}
|
|
@end example
|
|
|
|
@noindent
|
|
Each line of output is printed immediately. Compare that behavior
|
|
with this example:
|
|
|
|
@example
|
|
$ awk '@{ print $1 + $2 @}' | cat
|
|
1 1
|
|
2 3
|
|
@kbd{@value{CTL}-d}
|
|
@print{} 2
|
|
@print{} 5
|
|
@end example
|
|
|
|
@noindent
|
|
Here, no output is printed until after the @kbd{@value{CTL}-d} is typed, because
|
|
it is all buffered and sent down the pipe to @command{cat} in one shot.
|
|
|
|
@c fakenode --- for prepinfo
|
|
@subheading Advanced Notes: Controlling Output Buffering with @code{system}
|
|
@cindex advanced features, buffering
|
|
@cindex buffers, flushing
|
|
@cindex buffering, input/output
|
|
@cindex output, buffering
|
|
|
|
The @code{fflush} function provides explicit control over output buffering for
|
|
individual files and pipes. However, its use is not portable to many other
|
|
@command{awk} implementations. An alternative method to flush output
|
|
buffers is to call @code{system} with a null string as its argument:
|
|
|
|
@example
|
|
system("") # flush output
|
|
@end example
|
|
|
|
@noindent
|
|
@command{gawk} treats this use of the @code{system} function as a special
|
|
case and is smart enough not to run a shell (or other command
|
|
interpreter) with the empty command. Therefore, with @command{gawk}, this
|
|
idiom is not only useful, it is also efficient. While this method should work
|
|
with other @command{awk} implementations, it does not necessarily avoid
|
|
starting an unnecessary shell. (Other implementations may only
|
|
flush the buffer associated with the standard output and not necessarily
|
|
all buffered output.)
|
|
|
|
If you think about what a programmer expects, it makes sense that
|
|
@code{system} should flush any pending output. The following program:
|
|
|
|
@example
|
|
BEGIN @{
|
|
print "first print"
|
|
system("echo system echo")
|
|
print "second print"
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
must print:
|
|
|
|
@example
|
|
first print
|
|
system echo
|
|
second print
|
|
@end example
|
|
|
|
@noindent
|
|
and not:
|
|
|
|
@example
|
|
system echo
|
|
first print
|
|
second print
|
|
@end example
|
|
|
|
If @command{awk} did not flush its buffers before calling @code{system},
|
|
you would see the latter (undesirable) output.
|
|
|
|
@node Time Functions
|
|
@subsection Using @command{gawk}'s Timestamp Functions
|
|
|
|
@c STARTOFRANGE tst
|
|
@cindex timestamps
|
|
@c STARTOFRANGE logftst
|
|
@cindex log files, timestamps in
|
|
@c last comma does NOT start tertiary
|
|
@c STARTOFRANGE filogtst
|
|
@cindex files, log, timestamps in
|
|
@c STARTOFRANGE gawtst
|
|
@cindex @command{gawk}, timestamps
|
|
@cindex POSIX @command{awk}, timestamps and
|
|
@code{awk} programs are commonly used to process log files
|
|
containing timestamp information, indicating when a
|
|
particular log record was written. Many programs log their timestamp
|
|
in the form returned by the @code{time} system call, which is the
|
|
number of seconds since a particular epoch. On POSIX-compliant systems,
|
|
it is the number of seconds since
|
|
1970-01-01 00:00:00 UTC, not counting leap seconds.@footnote{@xref{Glossary},
|
|
especially the entries ``Epoch'' and ``UTC.''}
|
|
All known POSIX-compliant systems support timestamps from 0 through
|
|
@math{2^31 - 1}, which is sufficient to represent times through
|
|
2038-01-19 03:14:07 UTC. Many systems support a wider range of timestamps,
|
|
including negative timestamps that represent times before the
|
|
epoch.
|
|
|
|
@cindex @command{date} utility, GNU
|
|
@cindex time, retrieving
|
|
In order to make it easier to process such log files and to produce
|
|
useful reports, @command{gawk} provides the following functions for
|
|
working with timestamps. They are @command{gawk} extensions; they are
|
|
not specified in the POSIX standard, nor are they in any other known
|
|
version of @command{awk}.@footnote{The GNU @command{date} utility can
|
|
also do many of the things described here. Its use may be preferable
|
|
for simple time-related operations in shell scripts.}
|
|
Optional parameters are enclosed in square brackets ([ ]):
|
|
|
|
@table @code
|
|
@item systime()
|
|
@cindex @code{systime} function (@command{gawk})
|
|
@cindex timestamps
|
|
This function returns the current time as the number of seconds since
|
|
the system epoch. On POSIX systems, this is the number of seconds
|
|
since 1970-01-01 00:00:00 UTC, not counting leap seconds.
|
|
It may be a different number on
|
|
other systems.
|
|
|
|
@item mktime(@var{datespec})
|
|
@cindex @code{mktime} function (@command{gawk})
|
|
This function turns @var{datespec} into a timestamp in the same form
|
|
as is returned by @code{systime}. It is similar to the function of the
|
|
same name in ISO C. The argument, @var{datespec}, is a string of the form
|
|
@w{@code{"@var{YYYY} @var{MM} @var{DD} @var{HH} @var{MM} @var{SS} [@var{DST}]"}}.
|
|
The string consists of six or seven numbers representing, respectively,
|
|
the full year including century, the month from 1 to 12, the day of the month
|
|
from 1 to 31, the hour of the day from 0 to 23, the minute from 0 to
|
|
59, the second from 0 to 60,@footnote{Occasionally there are
|
|
minutes in a year with a leap second, which is why the
|
|
seconds can go up to 60.}
|
|
and an optional daylight-savings flag.
|
|
|
|
The values of these numbers need not be within the ranges specified;
|
|
for example, an hour of @minus{}1 means 1 hour before midnight.
|
|
The origin-zero Gregorian calendar is assumed, with year 0 preceding
|
|
year 1 and year @minus{}1 preceding year 0.
|
|
The time is assumed to be in the local timezone.
|
|
If the daylight-savings flag is positive, the time is assumed to be
|
|
daylight savings time; if zero, the time is assumed to be standard
|
|
time; and if negative (the default), @code{mktime} attempts to determine
|
|
whether daylight savings time is in effect for the specified time.
|
|
|
|
If @var{datespec} does not contain enough elements or if the resulting time
|
|
is out of range, @code{mktime} returns @minus{}1.
|
|
|
|
@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]})
|
|
@c STARTOFRANGE strf
|
|
@cindex @code{strftime} function (@command{gawk})
|
|
This function returns a string. It is similar to the function of the
|
|
same name in ISO C. The time specified by @var{timestamp} is used to
|
|
produce a string, based on the contents of the @var{format} string.
|
|
The @var{timestamp} is in the same format as the value returned by the
|
|
@code{systime} function. If no @var{timestamp} argument is supplied,
|
|
@command{gawk} uses the current time of day as the timestamp.
|
|
If no @var{format} argument is supplied, @code{strftime} uses
|
|
@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces
|
|
output that is (almost) equivalent to that of the @command{date} utility.
|
|
(Versions of @command{gawk} prior to 3.0 require the @var{format} argument.)
|
|
@end table
|
|
|
|
The @code{systime} function allows you to compare a timestamp from a
|
|
log file with the current time of day. In particular, it is easy to
|
|
determine how long ago a particular record was logged. It also allows
|
|
you to produce log records using the ``seconds since the epoch'' format.
|
|
|
|
@cindex converting, dates to timestamps
|
|
@cindex dates, converting to timestamps
|
|
@cindex timestamps, converting dates to
|
|
The @code{mktime} function allows you to convert a textual representation
|
|
of a date and time into a timestamp. This makes it easy to do before/after
|
|
comparisons of dates and times, particularly when dealing with date and
|
|
time data coming from an external source, such as a log file.
|
|
|
|
The @code{strftime} function allows you to easily turn a timestamp
|
|
into human-readable information. It is similar in nature to the @code{sprintf}
|
|
function
|
|
(@pxref{String Functions}),
|
|
in that it copies nonformat specification characters verbatim to the
|
|
returned string, while substituting date and time values for format
|
|
specifications in the @var{format} string.
|
|
|
|
@cindex format specifiers, @code{strftime} function (@command{gawk})
|
|
@code{strftime} is guaranteed by the 1999 ISO C standard@footnote{As this
|
|
is a recent standard, not every system's @code{strftime} necessarily
|
|
supports all of the conversions listed here.}
|
|
to support the following date format specifications:
|
|
|
|
@table @code
|
|
@item %a
|
|
The locale's abbreviated weekday name.
|
|
|
|
@item %A
|
|
The locale's full weekday name.
|
|
|
|
@item %b
|
|
The locale's abbreviated month name.
|
|
|
|
@item %B
|
|
The locale's full month name.
|
|
|
|
@item %c
|
|
The locale's ``appropriate'' date and time representation.
|
|
(This is @samp{%A %B %d %T %Y} in the @code{"C"} locale.)
|
|
|
|
@item %C
|
|
The century. This is the year divided by 100 and truncated to the next
|
|
lower integer.
|
|
|
|
@item %d
|
|
The day of the month as a decimal number (01--31).
|
|
|
|
@item %D
|
|
Equivalent to specifying @samp{%m/%d/%y}.
|
|
|
|
@item %e
|
|
The day of the month, padded with a space if it is only one digit.
|
|
|
|
@item %F
|
|
Equivalent to specifying @samp{%Y-%m-%d}.
|
|
This is the ISO 8601 date format.
|
|
|
|
@item %g
|
|
The year modulo 100 of the ISO week number, as a decimal number (00--99).
|
|
For example, January 1, 1993 is in week 53 of 1992. Thus, the year
|
|
of its ISO week number is 1992, even though its year is 1993.
|
|
Similarly, December 31, 1973 is in week 1 of 1974. Thus, the year
|
|
of its ISO week number is 1974, even though its year is 1973.
|
|
|
|
@item %G
|
|
The full year of the ISO week number, as a decimal number.
|
|
|
|
@item %h
|
|
Equivalent to @samp{%b}.
|
|
|
|
@item %H
|
|
The hour (24-hour clock) as a decimal number (00--23).
|
|
|
|
@item %I
|
|
The hour (12-hour clock) as a decimal number (01--12).
|
|
|
|
@item %j
|
|
The day of the year as a decimal number (001--366).
|
|
|
|
@item %m
|
|
The month as a decimal number (01--12).
|
|
|
|
@item %M
|
|
The minute as a decimal number (00--59).
|
|
|
|
@item %n
|
|
A newline character (ASCII LF).
|
|
|
|
@item %p
|
|
The locale's equivalent of the AM/PM designations associated
|
|
with a 12-hour clock.
|
|
|
|
@item %r
|
|
The locale's 12-hour clock time.
|
|
(This is @samp{%I:%M:%S %p} in the @code{"C"} locale.)
|
|
|
|
@item %R
|
|
Equivalent to specifying @samp{%H:%M}.
|
|
|
|
@item %S
|
|
The second as a decimal number (00--60).
|
|
|
|
@item %t
|
|
A TAB character.
|
|
|
|
@item %T
|
|
Equivalent to specifying @samp{%H:%M:%S}.
|
|
|
|
@item %u
|
|
The weekday as a decimal number (1--7). Monday is day one.
|
|
|
|
@item %U
|
|
The week number of the year (the first Sunday as the first day of week one)
|
|
as a decimal number (00--53).
|
|
|
|
@c @cindex ISO 8601
|
|
@item %V
|
|
The week number of the year (the first Monday as the first
|
|
day of week one) as a decimal number (01--53).
|
|
The method for determining the week number is as specified by ISO 8601.
|
|
(To wit: if the week containing January 1 has four or more days in the
|
|
new year, then it is week one; otherwise it is week 53 of the previous year
|
|
and the next week is week one.)
|
|
|
|
@item %w
|
|
The weekday as a decimal number (0--6). Sunday is day zero.
|
|
|
|
@item %W
|
|
The week number of the year (the first Monday as the first day of week one)
|
|
as a decimal number (00--53).
|
|
|
|
@item %x
|
|
The locale's ``appropriate'' date representation.
|
|
(This is @samp{%A %B %d %Y} in the @code{"C"} locale.)
|
|
|
|
@item %X
|
|
The locale's ``appropriate'' time representation.
|
|
(This is @samp{%T} in the @code{"C"} locale.)
|
|
|
|
@item %y
|
|
The year modulo 100 as a decimal number (00--99).
|
|
|
|
@item %Y
|
|
The full year as a decimal number (e.g., 1995).
|
|
|
|
@c @cindex RFC 822
|
|
@c @cindex RFC 1036
|
|
@item %z
|
|
The timezone offset in a +HHMM format (e.g., the format necessary to
|
|
produce RFC 822/RFC 1036 date headers).
|
|
|
|
@item %Z
|
|
The time zone name or abbreviation; no characters if
|
|
no time zone is determinable.
|
|
|
|
@item %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH
|
|
@itemx %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
|
|
``Alternate representations'' for the specifications
|
|
that use only the second letter (@samp{%c}, @samp{%C},
|
|
and so on).@footnote{If you don't understand any of this, don't worry about
|
|
it; these facilities are meant to make it easier to ``internationalize''
|
|
programs.
|
|
Other internationalization features are described in
|
|
@ref{Internationalization}.}
|
|
(These facilitate compliance with the POSIX @command{date} utility.)
|
|
|
|
@item %%
|
|
A literal @samp{%}.
|
|
@end table
|
|
|
|
If a conversion specifier is not one of the above, the behavior is
|
|
undefined.@footnote{This is because ISO C leaves the
|
|
behavior of the C version of @code{strftime} undefined and @command{gawk}
|
|
uses the system's version of @code{strftime} if it's there.
|
|
Typically, the conversion specifier either does not appear in the
|
|
returned string or appears literally.}
|
|
|
|
@c @cindex locale, definition of
|
|
Informally, a @dfn{locale} is the geographic place in which a program
|
|
is meant to run. For example, a common way to abbreviate the date
|
|
September 4, 1991 in the United States is ``9/4/91.''
|
|
In many countries in Europe, however, it is abbreviated ``4.9.91.''
|
|
Thus, the @samp{%x} specification in a @code{"US"} locale might produce
|
|
@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce
|
|
@samp{4.9.91}. The ISO C standard defines a default @code{"C"}
|
|
locale, which is an environment that is typical of what most C programmers
|
|
are used to.
|
|
|
|
A public-domain C version of @code{strftime} is supplied with @command{gawk}
|
|
for systems that are not yet fully standards-compliant.
|
|
It supports all of the just listed format specifications.
|
|
If that version is
|
|
used to compile @command{gawk} (@pxref{Installation}),
|
|
then the following additional format specifications are available:
|
|
|
|
@table @code
|
|
@item %k
|
|
The hour (24-hour clock) as a decimal number (0--23).
|
|
Single-digit numbers are padded with a space.
|
|
|
|
@item %l
|
|
The hour (12-hour clock) as a decimal number (1--12).
|
|
Single-digit numbers are padded with a space.
|
|
|
|
@item %N
|
|
The ``Emperor/Era'' name.
|
|
Equivalent to @code{%C}.
|
|
|
|
@item %o
|
|
The ``Emperor/Era'' year.
|
|
Equivalent to @code{%y}.
|
|
|
|
@item %s
|
|
The time as a decimal timestamp in seconds since the epoch.
|
|
|
|
@item %v
|
|
The date in VMS format (e.g., @samp{20-JUN-1991}).
|
|
@end table
|
|
@c ENDOFRANGE strf
|
|
|
|
Additionally, the alternate representations are recognized but their
|
|
normal representations are used.
|
|
|
|
@cindex @code{date} utility, POSIX
|
|
@cindex POSIX @command{awk}, @code{date} utility and
|
|
This example is an @command{awk} implementation of the POSIX
|
|
@command{date} utility. Normally, the @command{date} utility prints the
|
|
current date and time of day in a well-known format. However, if you
|
|
provide an argument to it that begins with a @samp{+}, @command{date}
|
|
copies nonformat specifier characters to the standard output and
|
|
interprets the current time according to the format specifiers in
|
|
the string. For example:
|
|
|
|
@example
|
|
$ date '+Today is %A, %B %d, %Y.'
|
|
@print{} Today is Thursday, September 14, 2000.
|
|
@end example
|
|
|
|
Here is the @command{gawk} version of the @command{date} utility.
|
|
It has a shell ``wrapper'' to handle the @option{-u} option,
|
|
which requires that @command{date} run as if the time zone
|
|
is set to UTC:
|
|
|
|
@example
|
|
#! /bin/sh
|
|
#
|
|
# date --- approximate the P1003.2 'date' command
|
|
|
|
case $1 in
|
|
-u) TZ=UTC0 # use UTC
|
|
export TZ
|
|
shift ;;
|
|
esac
|
|
|
|
@c FIXME: One day, change %d to %e, when C 99 is common.
|
|
gawk 'BEGIN @{
|
|
format = "%a %b %d %H:%M:%S %Z %Y"
|
|
exitval = 0
|
|
|
|
if (ARGC > 2)
|
|
exitval = 1
|
|
else if (ARGC == 2) @{
|
|
format = ARGV[1]
|
|
if (format ~ /^\+/)
|
|
format = substr(format, 2) # remove leading +
|
|
@}
|
|
print strftime(format)
|
|
exit exitval
|
|
@}' "$@@"
|
|
@end example
|
|
@c ENDOFRANGE tst
|
|
@c ENDOFRANGE logftst
|
|
@c ENDOFRANGE filogtst
|
|
@c ENDOFRANGE gawtst
|
|
|
|
@node Bitwise Functions
|
|
@subsection Bit-Manipulation Functions of @command{gawk}
|
|
@c STARTOFRANGE bit
|
|
@cindex bitwise, operations
|
|
@c STARTOFRANGE and
|
|
@cindex AND bitwise operation
|
|
@c STARTOFRANGE oro
|
|
@cindex OR bitwise operation
|
|
@c STARTOFRANGE xor
|
|
@cindex XOR bitwise operation
|
|
@c STARTOFRANGE opbit
|
|
@cindex operations, bitwise
|
|
@quotation
|
|
@i{I can explain it for you, but I can't understand it for you.}@*
|
|
Anonymous
|
|
@end quotation
|
|
|
|
Many languages provide the ability to perform @dfn{bitwise} operations
|
|
on two integer numbers. In other words, the operation is performed on
|
|
each successive pair of bits in the operands.
|
|
Three common operations are bitwise AND, OR, and XOR.
|
|
The operations are described by the following table:
|
|
|
|
@ifnottex
|
|
@display
|
|
Bit Operator
|
|
| AND | OR | XOR
|
|
|---+---+---+---+---+---
|
|
Operands | 0 | 1 | 0 | 1 | 0 | 1
|
|
----------+---+---+---+---+---+---
|
|
0 | 0 0 | 0 1 | 0 1
|
|
1 | 0 1 | 1 1 | 1 0
|
|
@end display
|
|
@end ifnottex
|
|
@tex
|
|
\centerline{
|
|
\vbox{\bigskip % space above the table (about 1 linespace)
|
|
% Because we have vertical rules, we can't let TeX insert interline space
|
|
% in its usual way.
|
|
\offinterlineskip
|
|
\halign{\strut\hfil#\quad\hfil % operands
|
|
&\vrule#&\quad#\quad % rule, 0 (of and)
|
|
&\vrule#&\quad#\quad % rule, 1 (of and)
|
|
&\vrule# % rule between and and or
|
|
&\quad#\quad % 0 (of or)
|
|
&\vrule#&\quad#\quad % rule, 1 (of of)
|
|
&\vrule# % rule between or and xor
|
|
&\quad#\quad % 0 of xor
|
|
&\vrule#&\quad#\quad % rule, 1 of xor
|
|
\cr
|
|
&\omit&\multispan{11}\hfil\bf Bit operator\hfil\cr
|
|
\noalign{\smallskip}
|
|
& &\multispan3\hfil AND\hfil&&\multispan3\hfil OR\hfil
|
|
&&\multispan3\hfil XOR\hfil\cr
|
|
\bf Operands&&0&&1&&0&&1&&0&&1\cr
|
|
\noalign{\hrule}
|
|
\omit&height 2pt&&\omit&&&&\omit&&&&\omit\cr
|
|
\noalign{\hrule height0pt}% without this the rule does not extend; why?
|
|
0&&0&\omit&0&&0&\omit&1&&0&\omit&1\cr
|
|
1&&0&\omit&1&&1&\omit&1&&1&\omit&0\cr
|
|
}}}
|
|
@end tex
|
|
|
|
@cindex bitwise, complement
|
|
@cindex complement, bitwise
|
|
As you can see, the result of an AND operation is 1 only when @emph{both}
|
|
bits are 1.
|
|
The result of an OR operation is 1 if @emph{either} bit is 1.
|
|
The result of an XOR operation is 1 if either bit is 1,
|
|
but not both.
|
|
The next operation is the @dfn{complement}; the complement of 1 is 0 and
|
|
the complement of 0 is 1. Thus, this operation ``flips'' all the bits
|
|
of a given value.
|
|
|
|
@cindex bitwise, shift
|
|
@cindex left shift, bitwise
|
|
@cindex right shift, bitwise
|
|
@cindex shift, bitwise
|
|
Finally, two other common operations are to shift the bits left or right.
|
|
For example, if you have a bit string @samp{10111001} and you shift it
|
|
right by three bits, you end up with @samp{00010111}.@footnote{This example
|
|
shows that 0's come in on the left side. For @command{gawk}, this is
|
|
always true, but in some languages, it's possible to have the left side
|
|
fill with 1's. Caveat emptor.}
|
|
@c Purposely decided to use 0's and 1's here. 2/2001.
|
|
If you start over
|
|
again with @samp{10111001} and shift it left by three bits, you end up
|
|
with @samp{11001000}.
|
|
@command{gawk} provides built-in functions that implement the
|
|
bitwise operations just described. They are:
|
|
|
|
@ignore
|
|
@table @code
|
|
@cindex @code{and} function (@command{gawk})
|
|
@item and(@var{v1}, @var{v2})
|
|
Return the bitwise AND of the values provided by @var{v1} and @var{v2}.
|
|
|
|
@cindex @code{or} function (@command{gawk})
|
|
@item or(@var{v1}, @var{v2})
|
|
Return the bitwise OR of the values provided by @var{v1} and @var{v2}.
|
|
|
|
@cindex @code{xor} function (@command{gawk})
|
|
@item xor(@var{v1}, @var{v2})
|
|
Return the bitwise XOR of the values provided by @var{v1} and @var{v2}.
|
|
|
|
@cindex @code{compl} function (@command{gawk})
|
|
@item compl(@var{val})
|
|
Return the bitwise complement of @var{val}.
|
|
|
|
@cindex @code{lshift} function (@command{gawk})
|
|
@item lshift(@var{val}, @var{count})
|
|
Return the value of @var{val}, shifted left by @var{count} bits.
|
|
|
|
@cindex @code{rshift} function (@command{gawk})
|
|
@item rshift(@var{val}, @var{count})
|
|
Return the value of @var{val}, shifted right by @var{count} bits.
|
|
@end table
|
|
@end ignore
|
|
|
|
@cindex @command{gawk}, bitwise operations in
|
|
@multitable {@code{rshift(@var{val}, @var{count})}} {Return the value of @var{val}, shifted right by @var{count} bits.}
|
|
@cindex @code{and} function (@command{gawk})
|
|
@item @code{and(@var{v1}, @var{v2})}
|
|
@tab Returns the bitwise AND of the values provided by @var{v1} and @var{v2}.
|
|
|
|
@cindex @code{or} function (@command{gawk})
|
|
@item @code{or(@var{v1}, @var{v2})}
|
|
@tab Returns the bitwise OR of the values provided by @var{v1} and @var{v2}.
|
|
|
|
@cindex @code{xor} function (@command{gawk})
|
|
@item @code{xor(@var{v1}, @var{v2})}
|
|
@tab Returns the bitwise XOR of the values provided by @var{v1} and @var{v2}.
|
|
|
|
@cindex @code{compl} function (@command{gawk})
|
|
@item @code{compl(@var{val})}
|
|
@tab Returns the bitwise complement of @var{val}.
|
|
|
|
@cindex @code{lshift} function (@command{gawk})
|
|
@item @code{lshift(@var{val}, @var{count})}
|
|
@tab Returns the value of @var{val}, shifted left by @var{count} bits.
|
|
|
|
@cindex @code{rshift} function (@command{gawk})
|
|
@item @code{rshift(@var{val}, @var{count})}
|
|
@tab Returns the value of @var{val}, shifted right by @var{count} bits.
|
|
@end multitable
|
|
|
|
For all of these functions, first the double-precision floating-point value is
|
|
converted to the widest C unsigned integer type, then the bitwise operation is
|
|
performed and then the result is converted back into a C @code{double}. (If
|
|
you don't understand this paragraph, don't worry about it.)
|
|
|
|
Here is a user-defined function
|
|
(@pxref{User-defined})
|
|
that illustrates the use of these functions:
|
|
|
|
@cindex @code{bits2str} user-defined function
|
|
@cindex @code{testbits.awk} program
|
|
@smallexample
|
|
@group
|
|
@c file eg/lib/bits2str.awk
|
|
# bits2str --- turn a byte into readable 1's and 0's
|
|
|
|
function bits2str(bits, data, mask)
|
|
@{
|
|
if (bits == 0)
|
|
return "0"
|
|
|
|
mask = 1
|
|
for (; bits != 0; bits = rshift(bits, 1))
|
|
data = (and(bits, mask) ? "1" : "0") data
|
|
|
|
while ((length(data) % 8) != 0)
|
|
data = "0" data
|
|
|
|
return data
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
|
|
@c this is a hack to make testbits.awk self-contained
|
|
@ignore
|
|
@c file eg/prog/testbits.awk
|
|
# bits2str --- turn a byte into readable 1's and 0's
|
|
|
|
function bits2str(bits, data, mask)
|
|
@{
|
|
if (bits == 0)
|
|
return "0"
|
|
|
|
mask = 1
|
|
for (; bits != 0; bits = rshift(bits, 1))
|
|
data = (and(bits, mask) ? "1" : "0") data
|
|
|
|
while ((length(data) % 8) != 0)
|
|
data = "0" data
|
|
|
|
return data
|
|
@}
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/testbits.awk
|
|
BEGIN @{
|
|
printf "123 = %s\n", bits2str(123)
|
|
printf "0123 = %s\n", bits2str(0123)
|
|
printf "0x99 = %s\n", bits2str(0x99)
|
|
comp = compl(0x99)
|
|
printf "compl(0x99) = %#x = %s\n", comp, bits2str(comp)
|
|
shift = lshift(0x99, 2)
|
|
printf "lshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)
|
|
shift = rshift(0x99, 2)
|
|
printf "rshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)
|
|
@}
|
|
@c endfile
|
|
@end smallexample
|
|
|
|
@noindent
|
|
This program produces the following output when run:
|
|
|
|
@smallexample
|
|
$ gawk -f testbits.awk
|
|
@print{} 123 = 01111011
|
|
@print{} 0123 = 01010011
|
|
@print{} 0x99 = 10011001
|
|
@print{} compl(0x99) = 0xffffff66 = 11111111111111111111111101100110
|
|
@print{} lshift(0x99, 2) = 0x264 = 0000001001100100
|
|
@print{} rshift(0x99, 2) = 0x26 = 00100110
|
|
@end smallexample
|
|
|
|
@cindex numbers, converting, to strings
|
|
@cindex strings, converting, numbers to
|
|
@cindex converting, numbers, to strings
|
|
The @code{bits2str} function turns a binary number into a string.
|
|
The number @code{1} represents a binary value where the rightmost bit
|
|
is set to 1. Using this mask,
|
|
the function repeatedly checks the rightmost bit.
|
|
ANDing the mask with the value indicates whether the
|
|
rightmost bit is 1 or not. If so, a @code{"1"} is concatenated onto the front
|
|
of the string.
|
|
Otherwise, a @code{"0"} is added.
|
|
The value is then shifted right by one bit and the loop continues
|
|
until there are no more 1 bits.
|
|
|
|
If the initial value is zero it returns a simple @code{"0"}.
|
|
Otherwise, at the end, it pads the value with zeros to represent multiples
|
|
of 8-bit quantities. This is typical in modern computers.
|
|
|
|
The main code in the @code{BEGIN} rule shows the difference between the
|
|
decimal and octal values for the same numbers
|
|
(@pxref{Nondecimal-numbers}),
|
|
and then demonstrates the
|
|
results of the @code{compl}, @code{lshift}, and @code{rshift} functions.
|
|
@c ENDOFRANGE bit
|
|
@c ENDOFRANGE and
|
|
@c ENDOFRANGE oro
|
|
@c ENDOFRANGE xor
|
|
@c ENDOFRANGE opbit
|
|
|
|
@node I18N Functions
|
|
@subsection Using @command{gawk}'s String-Translation Functions
|
|
@cindex @command{gawk}, string-translation functions
|
|
@cindex functions, string-translation
|
|
@cindex internationalization
|
|
@cindex @command{awk} programs, internationalizing
|
|
|
|
@command{gawk} provides facilities for internationalizing @command{awk} programs.
|
|
These include the functions described in the following list.
|
|
The descriptions here are purposely brief.
|
|
@xref{Internationalization},
|
|
for the full story.
|
|
Optional parameters are enclosed in square brackets ([ ]):
|
|
|
|
@table @code
|
|
@cindex @code{dcgettext} function (@command{gawk})
|
|
@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]})
|
|
This function returns the translation of @var{string} in
|
|
text domain @var{domain} for locale category @var{category}.
|
|
The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
|
|
The default value for @var{category} is @code{"LC_MESSAGES"}.
|
|
|
|
@cindex @code{dcngettext} function (@command{gawk})
|
|
@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]})
|
|
This function returns the plural form used for @var{number} of the
|
|
translation of @var{string1} and @var{string2} in text domain
|
|
@var{domain} for locale category @var{category}. @var{string1} is the
|
|
English singular variant of a message, and @var{string2} the English plural
|
|
variant of the same message.
|
|
The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
|
|
The default value for @var{category} is @code{"LC_MESSAGES"}.
|
|
|
|
@cindex @code{bindtextdomain} function (@command{gawk})
|
|
@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]})
|
|
This function allows you to specify the directory in which
|
|
@command{gawk} will look for message translation files, in case they
|
|
will not or cannot be placed in the ``standard'' locations
|
|
(e.g., during testing).
|
|
It returns the directory in which @var{domain} is ``bound.''
|
|
|
|
The default @var{domain} is the value of @code{TEXTDOMAIN}.
|
|
If @var{directory} is the null string (@code{""}), then
|
|
@code{bindtextdomain} returns the current binding for the
|
|
given @var{domain}.
|
|
@end table
|
|
@c ENDOFRANGE funcbi
|
|
@c ENDOFRANGE bifunc
|
|
|
|
@node User-defined
|
|
@section User-Defined Functions
|
|
|
|
@c STARTOFRANGE udfunc
|
|
@cindex user-defined, functions
|
|
@c STARTOFRANGE funcud
|
|
@cindex functions, user-defined
|
|
Complicated @command{awk} programs can often be simplified by defining
|
|
your own functions. User-defined functions can be called just like
|
|
built-in ones (@pxref{Function Calls}), but it is up to you to define
|
|
them, i.e., to tell @command{awk} what they should do.
|
|
|
|
@menu
|
|
* Definition Syntax:: How to write definitions and what they mean.
|
|
* Function Example:: An example function definition and what it
|
|
does.
|
|
* Function Caveats:: Things to watch out for.
|
|
* Return Statement:: Specifying the value a function returns.
|
|
* Dynamic Typing:: How variable types can change at runtime.
|
|
@end menu
|
|
|
|
@node Definition Syntax
|
|
@subsection Function Definition Syntax
|
|
|
|
@c STARTOFRANGE fdef
|
|
@cindex functions, defining
|
|
Definitions of functions can appear anywhere between the rules of an
|
|
@command{awk} program. Thus, the general form of an @command{awk} program is
|
|
extended to include sequences of rules @emph{and} user-defined function
|
|
definitions.
|
|
There is no need to put the definition of a function
|
|
before all uses of the function. This is because @command{awk} reads the
|
|
entire program before starting to execute any of it.
|
|
|
|
The definition of a function named @var{name} looks like this:
|
|
@c NEXT ED: put [ ] around parameter list
|
|
|
|
@example
|
|
function @var{name}(@var{parameter-list})
|
|
@{
|
|
@var{body-of-function}
|
|
@}
|
|
@end example
|
|
|
|
@cindex names, functions
|
|
@cindex functions, names of
|
|
@cindex namespace issues, functions
|
|
@noindent
|
|
@var{name} is the name of the function to define. A valid function
|
|
name is like a valid variable name: a sequence of letters, digits, and
|
|
underscores that doesn't start with a digit.
|
|
Within a single @command{awk} program, any particular name can only be
|
|
used as a variable, array, or function.
|
|
|
|
@c NEXT ED: parameter-list is an OPTIONAL list of ...
|
|
@var{parameter-list} is a list of the function's arguments and local
|
|
variable names, separated by commas. When the function is called,
|
|
the argument names are used to hold the argument values given in
|
|
the call. The local variables are initialized to the empty string.
|
|
A function cannot have two parameters with the same name, nor may it
|
|
have a parameter with the same name as the function itself.
|
|
|
|
The @var{body-of-function} consists of @command{awk} statements. It is the
|
|
most important part of the definition, because it says what the function
|
|
should actually @emph{do}. The argument names exist to give the body a
|
|
way to talk about the arguments; local variables exist to give the body
|
|
places to keep temporary values.
|
|
|
|
Argument names are not distinguished syntactically from local variable
|
|
names. Instead, the number of arguments supplied when the function is
|
|
called determines how many argument variables there are. Thus, if three
|
|
argument values are given, the first three names in @var{parameter-list}
|
|
are arguments and the rest are local variables.
|
|
|
|
It follows that if the number of arguments is not the same in all calls
|
|
to the function, some of the names in @var{parameter-list} may be
|
|
arguments on some occasions and local variables on others. Another
|
|
way to think of this is that omitted arguments default to the
|
|
null string.
|
|
|
|
@cindex programming conventions, functions, writing
|
|
Usually when you write a function, you know how many names you intend to
|
|
use for arguments and how many you intend to use as local variables. It is
|
|
conventional to place some extra space between the arguments and
|
|
the local variables, in order to document how your function is supposed to be used.
|
|
|
|
@cindex variables, shadowing
|
|
During execution of the function body, the arguments and local variable
|
|
values hide, or @dfn{shadow}, any variables of the same names used in the
|
|
rest of the program. The shadowed variables are not accessible in the
|
|
function definition, because there is no way to name them while their
|
|
names have been taken away for the local variables. All other variables
|
|
used in the @command{awk} program can be referenced or set normally in the
|
|
function's body.
|
|
|
|
The arguments and local variables last only as long as the function body
|
|
is executing. Once the body finishes, you can once again access the
|
|
variables that were shadowed while the function was running.
|
|
|
|
@cindex recursive functions
|
|
@cindex functions, recursive
|
|
The function body can contain expressions that call functions. They
|
|
can even call this function, either directly or by way of another
|
|
function. When this happens, we say the function is @dfn{recursive}.
|
|
The act of a function calling itself is called @dfn{recursion}.
|
|
|
|
@c @cindex @command{awk} language, POSIX version
|
|
@c @cindex POSIX @command{awk}
|
|
@cindex POSIX @command{awk}, @code{function} keyword in
|
|
In many @command{awk} implementations, including @command{gawk},
|
|
the keyword @code{function} may be
|
|
abbreviated @code{func}. However, POSIX only specifies the use of
|
|
the keyword @code{function}. This actually has some practical implications.
|
|
If @command{gawk} is in POSIX-compatibility mode
|
|
(@pxref{Options}), then the following
|
|
statement does @emph{not} define a function:
|
|
|
|
@example
|
|
func foo() @{ a = sqrt($1) ; print a @}
|
|
@end example
|
|
|
|
@noindent
|
|
Instead it defines a rule that, for each record, concatenates the value
|
|
of the variable @samp{func} with the return value of the function @samp{foo}.
|
|
If the resulting string is non-null, the action is executed.
|
|
This is probably not what is desired. (@command{awk} accepts this input as
|
|
syntactically valid, because functions may be used before they are defined
|
|
in @command{awk} programs.)
|
|
@c NEXT ED: This won't actually run, since foo() is undefined ...
|
|
|
|
@c last comma does NOT start tertiary
|
|
@cindex portability, functions, defining
|
|
To ensure that your @command{awk} programs are portable, always use the
|
|
keyword @code{function} when defining a function.
|
|
|
|
@node Function Example
|
|
@subsection Function Definition Examples
|
|
|
|
Here is an example of a user-defined function, called @code{myprint}, that
|
|
takes a number and prints it in a specific format:
|
|
|
|
@example
|
|
function myprint(num)
|
|
@{
|
|
printf "%6.3g\n", num
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
To illustrate, here is an @command{awk} rule that uses our @code{myprint}
|
|
function:
|
|
|
|
@example
|
|
$3 > 0 @{ myprint($3) @}
|
|
@end example
|
|
|
|
@noindent
|
|
This program prints, in our special format, all the third fields that
|
|
contain a positive number in our input. Therefore, when given the following:
|
|
|
|
@example
|
|
1.2 3.4 5.6 7.8
|
|
9.10 11.12 -13.14 15.16
|
|
17.18 19.20 21.22 23.24
|
|
@end example
|
|
|
|
@noindent
|
|
this program, using our function to format the results, prints:
|
|
|
|
@example
|
|
5.6
|
|
21.2
|
|
@end example
|
|
|
|
This function deletes all the elements in an array:
|
|
|
|
@example
|
|
function delarray(a, i)
|
|
@{
|
|
for (i in a)
|
|
delete a[i]
|
|
@}
|
|
@end example
|
|
|
|
When working with arrays, it is often necessary to delete all the elements
|
|
in an array and start over with a new list of elements
|
|
(@pxref{Delete}).
|
|
Instead of having
|
|
to repeat this loop everywhere that you need to clear out
|
|
an array, your program can just call @code{delarray}.
|
|
(This guarantees portability. The use of @samp{delete @var{array}} to delete
|
|
the contents of an entire array is a nonstandard extension.)
|
|
|
|
The following is an example of a recursive function. It takes a string
|
|
as an input parameter and returns the string in backwards order.
|
|
Recursive functions must always have a test that stops the recursion.
|
|
In this case, the recursion terminates when the starting position
|
|
is zero, i.e., when there are no more characters left in the string.
|
|
|
|
@cindex @code{rev} user-defined function
|
|
@example
|
|
function rev(str, start)
|
|
@{
|
|
if (start == 0)
|
|
return ""
|
|
|
|
return (substr(str, start, 1) rev(str, start - 1))
|
|
@}
|
|
@end example
|
|
|
|
If this function is in a file named @file{rev.awk}, it can be tested
|
|
this way:
|
|
|
|
@example
|
|
$ echo "Don't Panic!" |
|
|
> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk
|
|
@print{} !cinaP t'noD
|
|
@end example
|
|
|
|
The C @code{ctime} function takes a timestamp and returns it in a string,
|
|
formatted in a well-known fashion.
|
|
The following example uses the built-in @code{strftime} function
|
|
(@pxref{Time Functions})
|
|
to create an @command{awk} version of @code{ctime}:
|
|
|
|
@cindex @code{ctime} user-defined function
|
|
@c FIXME: One day, change %d to %e, when C 99 is common.
|
|
@example
|
|
@c file eg/lib/ctime.awk
|
|
# ctime.awk
|
|
#
|
|
# awk version of C ctime(3) function
|
|
|
|
function ctime(ts, format)
|
|
@{
|
|
format = "%a %b %d %H:%M:%S %Z %Y"
|
|
if (ts == 0)
|
|
ts = systime() # use current time as default
|
|
return strftime(format, ts)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@c ENDOFRANGE fdef
|
|
|
|
@node Function Caveats
|
|
@subsection Calling User-Defined Functions
|
|
|
|
@c STARTOFRANGE fudc
|
|
@cindex functions, user-defined, calling
|
|
@dfn{Calling a function} means causing the function to run and do its job.
|
|
A function call is an expression and its value is the value returned by
|
|
the function.
|
|
|
|
A function call consists of the function name followed by the arguments
|
|
in parentheses. @command{awk} expressions are what you write in the
|
|
call for the arguments. Each time the call is executed, these
|
|
expressions are evaluated, and the values are the actual arguments. For
|
|
example, here is a call to @code{foo} with three arguments (the first
|
|
being a string concatenation):
|
|
|
|
@example
|
|
foo(x y, "lose", 4 * z)
|
|
@end example
|
|
|
|
@strong{Caution:} Whitespace characters (spaces and tabs) are not allowed
|
|
between the function name and the open-parenthesis of the argument list.
|
|
If you write whitespace by mistake, @command{awk} might think that you mean
|
|
to concatenate a variable with an expression in parentheses. However, it
|
|
notices that you used a function name and not a variable name, and reports
|
|
an error.
|
|
|
|
@cindex call by value
|
|
When a function is called, it is given a @emph{copy} of the values of
|
|
its arguments. This is known as @dfn{call by value}. The caller may use
|
|
a variable as the expression for the argument, but the called function
|
|
does not know this---it only knows what value the argument had. For
|
|
example, if you write the following code:
|
|
|
|
@example
|
|
foo = "bar"
|
|
z = myfunc(foo)
|
|
@end example
|
|
|
|
@noindent
|
|
then you should not think of the argument to @code{myfunc} as being
|
|
``the variable @code{foo}.'' Instead, think of the argument as the
|
|
string value @code{"bar"}.
|
|
If the function @code{myfunc} alters the values of its local variables,
|
|
this has no effect on any other variables. Thus, if @code{myfunc}
|
|
does this:
|
|
|
|
@example
|
|
function myfunc(str)
|
|
@{
|
|
print str
|
|
str = "zzz"
|
|
print str
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
to change its first argument variable @code{str}, it does @emph{not}
|
|
change the value of @code{foo} in the caller. The role of @code{foo} in
|
|
calling @code{myfunc} ended when its value (@code{"bar"}) was computed.
|
|
If @code{str} also exists outside of @code{myfunc}, the function body
|
|
cannot alter this outer value, because it is shadowed during the
|
|
execution of @code{myfunc} and cannot be seen or changed from there.
|
|
|
|
@cindex call by reference
|
|
@cindex arrays, as parameters to functions
|
|
@cindex functions, arrays as parameters to
|
|
However, when arrays are the parameters to functions, they are @emph{not}
|
|
copied. Instead, the array itself is made available for direct manipulation
|
|
by the function. This is usually called @dfn{call by reference}.
|
|
Changes made to an array parameter inside the body of a function @emph{are}
|
|
visible outside that function.
|
|
|
|
@strong{Note:} Changing an array parameter inside a function
|
|
can be very dangerous if you do not watch what you are doing.
|
|
For example:
|
|
|
|
@example
|
|
function changeit(array, ind, nvalue)
|
|
@{
|
|
array[ind] = nvalue
|
|
@}
|
|
|
|
BEGIN @{
|
|
a[1] = 1; a[2] = 2; a[3] = 3
|
|
changeit(a, 2, "two")
|
|
printf "a[1] = %s, a[2] = %s, a[3] = %s\n",
|
|
a[1], a[2], a[3]
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because
|
|
@code{changeit} stores @code{"two"} in the second element of @code{a}.
|
|
|
|
@cindex undefined functions
|
|
@cindex functions, undefined
|
|
Some @command{awk} implementations allow you to call a function that
|
|
has not been defined. They only report a problem at runtime when the
|
|
program actually tries to call the function. For example:
|
|
|
|
@example
|
|
BEGIN @{
|
|
if (0)
|
|
foo()
|
|
else
|
|
bar()
|
|
@}
|
|
function bar() @{ @dots{} @}
|
|
# note that `foo' is not defined
|
|
@end example
|
|
|
|
@noindent
|
|
Because the @samp{if} statement will never be true, it is not really a
|
|
problem that @code{foo} has not been defined. Usually, though, it is a
|
|
problem if a program calls an undefined function.
|
|
|
|
@cindex lint checking, undefined functions
|
|
If @option{--lint} is specified
|
|
(@pxref{Options}),
|
|
@command{gawk} reports calls to undefined functions.
|
|
|
|
@cindex portability, @code{next} statement in user-defined functions
|
|
Some @command{awk} implementations generate a runtime
|
|
error if you use the @code{next} statement
|
|
(@pxref{Next Statement})
|
|
inside a user-defined function.
|
|
@command{gawk} does not have this limitation.
|
|
@c ENDOFRANGE fudc
|
|
|
|
@node Return Statement
|
|
@subsection The @code{return} Statement
|
|
@c comma does NOT start a secondary
|
|
@cindex @code{return} statement, user-defined functions
|
|
|
|
The body of a user-defined function can contain a @code{return} statement.
|
|
This statement returns control to the calling part of the @command{awk} program. It
|
|
can also be used to return a value for use in the rest of the @command{awk}
|
|
program. It looks like this:
|
|
|
|
@example
|
|
return @r{[}@var{expression}@r{]}
|
|
@end example
|
|
|
|
The @var{expression} part is optional. If it is omitted, then the returned
|
|
value is undefined, and therefore, unpredictable.
|
|
|
|
A @code{return} statement with no value expression is assumed at the end of
|
|
every function definition. So if control reaches the end of the function
|
|
body, then the function returns an unpredictable value. @command{awk}
|
|
does @emph{not} warn you if you use the return value of such a function.
|
|
|
|
Sometimes, you want to write a function for what it does, not for
|
|
what it returns. Such a function corresponds to a @code{void} function
|
|
in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not
|
|
return any value; simply bear in mind that if you use the return
|
|
value of such a function, you do so at your own risk.
|
|
|
|
The following is an example of a user-defined function that returns a value
|
|
for the largest number among the elements of an array:
|
|
|
|
@example
|
|
function maxelt(vec, i, ret)
|
|
@{
|
|
for (i in vec) @{
|
|
if (ret == "" || vec[i] > ret)
|
|
ret = vec[i]
|
|
@}
|
|
return ret
|
|
@}
|
|
@end example
|
|
|
|
@cindex programming conventions, function parameters
|
|
@noindent
|
|
You call @code{maxelt} with one argument, which is an array name. The local
|
|
variables @code{i} and @code{ret} are not intended to be arguments;
|
|
while there is nothing to stop you from passing more than one argument
|
|
to @code{maxelt}, the results would be strange. The extra space before
|
|
@code{i} in the function parameter list indicates that @code{i} and
|
|
@code{ret} are not supposed to be arguments.
|
|
You should follow this convention when defining functions.
|
|
|
|
The following program uses the @code{maxelt} function. It loads an
|
|
array, calls @code{maxelt}, and then reports the maximum number in that
|
|
array:
|
|
|
|
@example
|
|
function maxelt(vec, i, ret)
|
|
@{
|
|
for (i in vec) @{
|
|
if (ret == "" || vec[i] > ret)
|
|
ret = vec[i]
|
|
@}
|
|
return ret
|
|
@}
|
|
|
|
# Load all fields of each record into nums.
|
|
@{
|
|
for(i = 1; i <= NF; i++)
|
|
nums[NR, i] = $i
|
|
@}
|
|
|
|
END @{
|
|
print maxelt(nums)
|
|
@}
|
|
@end example
|
|
|
|
Given the following input:
|
|
|
|
@example
|
|
1 5 23 8 16
|
|
44 3 5 2 8 26
|
|
256 291 1396 2962 100
|
|
-6 467 998 1101
|
|
99385 11 0 225
|
|
@end example
|
|
|
|
@noindent
|
|
the program reports (predictably) that @code{99385} is the largest number
|
|
in the array.
|
|
|
|
@node Dynamic Typing
|
|
@subsection Functions and Their Effects on Variable Typing
|
|
|
|
@command{awk} is a very fluid language.
|
|
It is possible that @command{awk} can't tell if an identifier
|
|
represents a regular variable or an array until runtime.
|
|
Here is an annotated sample program:
|
|
|
|
@example
|
|
function foo(a)
|
|
@{
|
|
a[1] = 1 # parameter is an array
|
|
@}
|
|
|
|
BEGIN @{
|
|
b = 1
|
|
foo(b) # invalid: fatal type mismatch
|
|
|
|
foo(x) # x uninitialized, becomes an array dynamically
|
|
x = 1 # now not allowed, runtime error
|
|
@}
|
|
@end example
|
|
|
|
Usually, such things aren't a big issue, but it's worth
|
|
being aware of them.
|
|
@c ENDOFRANGE udfunc
|
|
@c ENDOFRANGE funcud
|
|
|
|
@node Internationalization
|
|
@chapter Internationalization with @command{gawk}
|
|
|
|
Once upon a time, computer makers
|
|
wrote software that worked only in English.
|
|
Eventually, hardware and software vendors noticed that if their
|
|
systems worked in the native languages of non-English-speaking
|
|
countries, they were able to sell more systems.
|
|
As a result, internationalization and localization
|
|
of programs and software systems became a common practice.
|
|
|
|
@c STARTOFRANGE inloc
|
|
@cindex internationalization, localization
|
|
@cindex @command{gawk}, internationalization and, See internationalization
|
|
@cindex internationalization, localization, @command{gawk} and
|
|
Until recently, the ability to provide internationalization
|
|
was largely restricted to programs written in C and C++.
|
|
This @value{CHAPTER} describes the underlying library @command{gawk}
|
|
uses for internationalization, as well as how
|
|
@command{gawk} makes internationalization
|
|
features available at the @command{awk} program level.
|
|
Having internationalization available at the @command{awk} level
|
|
gives software developers additional flexibility---they are no
|
|
longer required to write in C when internationalization is
|
|
a requirement.
|
|
|
|
@menu
|
|
* I18N and L10N:: Internationalization and Localization.
|
|
* Explaining gettext:: How GNU @code{gettext} works.
|
|
* Programmer i18n:: Features for the programmer.
|
|
* Translator i18n:: Features for the translator.
|
|
* I18N Example:: A simple i18n example.
|
|
* Gawk I18N:: @command{gawk} is also internationalized.
|
|
@end menu
|
|
|
|
@node I18N and L10N
|
|
@section Internationalization and Localization
|
|
|
|
@cindex internationalization
|
|
@c comma is part of see
|
|
@cindex localization, See internationalization, localization
|
|
@cindex localization
|
|
@dfn{Internationalization} means writing (or modifying) a program once,
|
|
in such a way that it can use multiple languages without requiring
|
|
further source-code changes.
|
|
@dfn{Localization} means providing the data necessary for an
|
|
internationalized program to work in a particular language.
|
|
Most typically, these terms refer to features such as the language
|
|
used for printing error messages, the language used to read
|
|
responses, and information related to how numerical and
|
|
monetary values are printed and read.
|
|
|
|
@node Explaining gettext
|
|
@section GNU @code{gettext}
|
|
|
|
@cindex internationalizing a program
|
|
@c STARTOFRANGE gettex
|
|
@cindex @code{gettext} library
|
|
The facilities in GNU @code{gettext} focus on messages; strings printed
|
|
by a program, either directly or via formatting with @code{printf} or
|
|
@code{sprintf}.@footnote{For some operating systems, the @command{gawk}
|
|
port doesn't support GNU @code{gettext}. This applies most notably to
|
|
the PC operating systems. As such, these features are not available
|
|
if you are using one of those operating systems. Sorry.}
|
|
|
|
@cindex portability, @code{gettext} library and
|
|
When using GNU @code{gettext}, each application has its own
|
|
@dfn{text domain}. This is a unique name, such as @samp{kpilot} or @samp{gawk},
|
|
that identifies the application.
|
|
A complete application may have multiple components---programs written
|
|
in C or C++, as well as scripts written in @command{sh} or @command{awk}.
|
|
All of the components use the same text domain.
|
|
|
|
To make the discussion concrete, assume we're writing an application
|
|
named @command{guide}. Internationalization consists of the
|
|
following steps, in this order:
|
|
|
|
@enumerate
|
|
@item
|
|
The programmer goes
|
|
through the source for all of @command{guide}'s components
|
|
and marks each string that is a candidate for translation.
|
|
For example, @code{"`-F': option required"} is a good candidate for translation.
|
|
A table with strings of option names is not (e.g., @command{gawk}'s
|
|
@option{--profile} option should remain the same, no matter what the local
|
|
language).
|
|
|
|
@cindex @code{textdomain} function (C library)
|
|
@item
|
|
The programmer indicates the application's text domain
|
|
(@code{"guide"}) to the @code{gettext} library,
|
|
by calling the @code{textdomain} function.
|
|
|
|
@item
|
|
Messages from the application are extracted from the source code and
|
|
collected into a portable object file (@file{guide.po}),
|
|
which lists the strings and their translations.
|
|
The translations are initially empty.
|
|
The original (usually English) messages serve as the key for
|
|
lookup of the translations.
|
|
|
|
@cindex @code{.po} files
|
|
@cindex files, @code{.po}
|
|
@cindex portable object files
|
|
@cindex files, portable object
|
|
@item
|
|
For each language with a translator, @file{guide.po}
|
|
is copied and translations are created and shipped with the application.
|
|
|
|
@cindex @code{.mo} files
|
|
@cindex files, @code{.mo}
|
|
@cindex message object files
|
|
@cindex files, message object
|
|
@item
|
|
Each language's @file{.po} file is converted into a binary
|
|
message object (@file{.mo}) file.
|
|
A message object file contains the original messages and their
|
|
translations in a binary format that allows fast lookup of translations
|
|
at runtime.
|
|
|
|
@item
|
|
When @command{guide} is built and installed, the binary translation files
|
|
are installed in a standard place.
|
|
|
|
@cindex @code{bindtextdomain} function (C library)
|
|
@item
|
|
For testing and development, it is possible to tell @code{gettext}
|
|
to use @file{.mo} files in a different directory than the standard
|
|
one by using the @code{bindtextdomain} function.
|
|
|
|
@cindex @code{.mo} files, specifying directory of
|
|
@cindex files, @code{.mo}, specifying directory of
|
|
@cindex message object files, specifying directory of
|
|
@cindex files, message object, specifying directory of
|
|
@item
|
|
At runtime, @command{guide} looks up each string via a call
|
|
to @code{gettext}. The returned string is the translated string
|
|
if available, or the original string if not.
|
|
|
|
@item
|
|
If necessary, it is possible to access messages from a different
|
|
text domain than the one belonging to the application, without
|
|
having to switch the application's default text domain back
|
|
and forth.
|
|
@end enumerate
|
|
|
|
@cindex @code{gettext} function (C library)
|
|
In C (or C++), the string marking and dynamic translation lookup
|
|
are accomplished by wrapping each string in a call to @code{gettext}:
|
|
|
|
@example
|
|
printf(gettext("Don't Panic!\n"));
|
|
@end example
|
|
|
|
The tools that extract messages from source code pull out all
|
|
strings enclosed in calls to @code{gettext}.
|
|
|
|
@cindex @code{_} (underscore), @code{_} C macro
|
|
@cindex underscore (@code{_}), @code{_} C macro
|
|
The GNU @code{gettext} developers, recognizing that typing
|
|
@samp{gettext} over and over again is both painful and ugly to look
|
|
at, use the macro @samp{_} (an underscore) to make things easier:
|
|
|
|
@example
|
|
/* In the standard header file: */
|
|
#define _(str) gettext(str)
|
|
|
|
/* In the program text: */
|
|
printf(_("Don't Panic!\n"));
|
|
@end example
|
|
|
|
@cindex internationalization, localization, locale categories
|
|
@cindex @code{gettext} library, locale categories
|
|
@cindex locale categories
|
|
@noindent
|
|
This reduces the typing overhead to just three extra characters per string
|
|
and is considerably easier to read as well.
|
|
There are locale @dfn{categories}
|
|
for different types of locale-related information.
|
|
The defined locale categories that @code{gettext} knows about are:
|
|
|
|
@table @code
|
|
@cindex @code{LC_MESSAGES} locale category
|
|
@item LC_MESSAGES
|
|
Text messages. This is the default category for @code{gettext}
|
|
operations, but it is possible to supply a different one explicitly,
|
|
if necessary. (It is almost never necessary to supply a different category.)
|
|
|
|
@cindex sorting characters in different languages
|
|
@cindex @code{LC_COLLATE} locale category
|
|
@item LC_COLLATE
|
|
Text-collation information; i.e., how different characters
|
|
and/or groups of characters sort in a given language.
|
|
|
|
@cindex @code{LC_CTYPE} locale category
|
|
@item LC_CTYPE
|
|
Character-type information (alphabetic, digit, upper- or lowercase, and
|
|
so on).
|
|
This information is accessed via the
|
|
POSIX character classes in regular expressions,
|
|
such as @code{/[[:alnum:]]/}
|
|
(@pxref{Regexp Operators}).
|
|
|
|
@cindex monetary information, localization
|
|
@cindex currency symbols, localization
|
|
@cindex @code{LC_MONETARY} locale category
|
|
@item LC_MONETARY
|
|
Monetary information, such as the currency symbol, and whether the
|
|
symbol goes before or after a number.
|
|
|
|
@cindex @code{LC_NUMERIC} locale category
|
|
@item LC_NUMERIC
|
|
Numeric information, such as which characters to use for the decimal
|
|
point and the thousands separator.@footnote{Americans
|
|
use a comma every three decimal places and a period for the decimal
|
|
point, while many Europeans do exactly the opposite:
|
|
@code{1,234.56} versus @code{1.234,56}.}
|
|
|
|
@cindex @code{LC_RESPONSE} locale category
|
|
@item LC_RESPONSE
|
|
Response information, such as how ``yes'' and ``no'' appear in the
|
|
local language, and possibly other information as well.
|
|
|
|
@cindex time, localization and
|
|
@c last comma does NOT start a tertiary
|
|
@cindex dates, information related to, localization
|
|
@cindex @code{LC_TIME} locale category
|
|
@item LC_TIME
|
|
Time- and date-related information, such as 12- or 24-hour clock, month printed
|
|
before or after day in a date, local month abbreviations, and so on.
|
|
|
|
@cindex @code{LC_ALL} locale category
|
|
@item LC_ALL
|
|
All of the above. (Not too useful in the context of @code{gettext}.)
|
|
@end table
|
|
@c ENDOFRANGE gettex
|
|
|
|
@node Programmer i18n
|
|
@section Internationalizing @command{awk} Programs
|
|
@c STARTOFRANGE inap
|
|
@cindex @command{awk} programs, internationalizing
|
|
|
|
@command{gawk} provides the following variables and functions for
|
|
internationalization:
|
|
|
|
@table @code
|
|
@cindex @code{TEXTDOMAIN} variable
|
|
@item TEXTDOMAIN
|
|
This variable indicates the application's text domain.
|
|
For compatibility with GNU @code{gettext}, the default
|
|
value is @code{"messages"}.
|
|
|
|
@cindex internationalization, localization, marked strings
|
|
@cindex strings, for localization
|
|
@item _"your message here"
|
|
String constants marked with a leading underscore
|
|
are candidates for translation at runtime.
|
|
String constants without a leading underscore are not translated.
|
|
|
|
@cindex @code{dcgettext} function (@command{gawk})
|
|
@item dcgettext(@var{string} @r{[}, @var{domain} @r{[}, @var{category}@r{]]})
|
|
This built-in function returns the translation of @var{string} in
|
|
text domain @var{domain} for locale category @var{category}.
|
|
The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
|
|
The default value for @var{category} is @code{"LC_MESSAGES"}.
|
|
|
|
If you supply a value for @var{category}, it must be a string equal to
|
|
one of the known locale categories described in
|
|
@ifnotinfo
|
|
the previous @value{SECTION}.
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
@ref{Explaining gettext}.
|
|
@end ifinfo
|
|
You must also supply a text domain. Use @code{TEXTDOMAIN} if
|
|
you want to use the current domain.
|
|
|
|
@strong{Caution:} The order of arguments to the @command{awk} version
|
|
of the @code{dcgettext} function is purposely different from the order for
|
|
the C version. The @command{awk} version's order was
|
|
chosen to be simple and to allow for reasonable @command{awk}-style
|
|
default arguments.
|
|
|
|
@cindex @code{dcngettext} function (@command{gawk})
|
|
@item dcngettext(@var{string1}, @var{string2}, @var{number} @r{[}, @var{domain} @r{[}, @var{category}@r{]]})
|
|
This built-in function returns the plural form used for @var{number} of the
|
|
translation of @var{string1} and @var{string2} in text domain
|
|
@var{domain} for locale category @var{category}. @var{string1} is the
|
|
English singular variant of a message, and @var{string2} the English plural
|
|
variant of the same message.
|
|
The default value for @var{domain} is the current value of @code{TEXTDOMAIN}.
|
|
The default value for @var{category} is @code{"LC_MESSAGES"}.
|
|
|
|
The same remarks as for the @code{dcgettext} function apply.
|
|
|
|
@cindex @code{.mo} files, specifying directory of
|
|
@cindex files, @code{.mo}, specifying directory of
|
|
@cindex message object files, specifying directory of
|
|
@cindex files, message object, specifying directory of
|
|
@cindex @code{bindtextdomain} function (@command{gawk})
|
|
@item bindtextdomain(@var{directory} @r{[}, @var{domain}@r{]})
|
|
This built-in function allows you to specify the directory in which
|
|
@code{gettext} looks for @file{.mo} files, in case they
|
|
will not or cannot be placed in the standard locations
|
|
(e.g., during testing).
|
|
It returns the directory in which @var{domain} is ``bound.''
|
|
|
|
The default @var{domain} is the value of @code{TEXTDOMAIN}.
|
|
If @var{directory} is the null string (@code{""}), then
|
|
@code{bindtextdomain} returns the current binding for the
|
|
given @var{domain}.
|
|
@end table
|
|
|
|
To use these facilities in your @command{awk} program, follow the steps
|
|
outlined in
|
|
@ifnotinfo
|
|
the previous @value{SECTION},
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
@ref{Explaining gettext},
|
|
@end ifinfo
|
|
like so:
|
|
|
|
@enumerate
|
|
@cindex @code{BEGIN} pattern, @code{TEXTDOMAIN} variable and
|
|
@cindex @code{TEXTDOMAIN} variable, @code{BEGIN} pattern and
|
|
@item
|
|
Set the variable @code{TEXTDOMAIN} to the text domain of
|
|
your program. This is best done in a @code{BEGIN} rule
|
|
(@pxref{BEGIN/END}),
|
|
or it can also be done via the @option{-v} command-line
|
|
option (@pxref{Options}):
|
|
|
|
@example
|
|
BEGIN @{
|
|
TEXTDOMAIN = "guide"
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@cindex @code{_} (underscore), translatable string
|
|
@cindex underscore (@code{_}), translatable string
|
|
@item
|
|
Mark all translatable strings with a leading underscore (@samp{_})
|
|
character. It @emph{must} be adjacent to the opening
|
|
quote of the string. For example:
|
|
|
|
@example
|
|
print _"hello, world"
|
|
x = _"you goofed"
|
|
printf(_"Number of users is %d\n", nusers)
|
|
@end example
|
|
|
|
@item
|
|
If you are creating strings dynamically, you can
|
|
still translate them, using the @code{dcgettext}
|
|
built-in function:
|
|
|
|
@example
|
|
message = nusers " users logged in"
|
|
message = dcgettext(message, "adminprog")
|
|
print message
|
|
@end example
|
|
|
|
Here, the call to @code{dcgettext} supplies a different
|
|
text domain (@code{"adminprog"}) in which to find the
|
|
message, but it uses the default @code{"LC_MESSAGES"} category.
|
|
|
|
@cindex @code{LC_MESSAGES} locale category, @code{bindtextdomain} function (@command{gawk})
|
|
@item
|
|
During development, you might want to put the @file{.mo}
|
|
file in a private directory for testing. This is done
|
|
with the @code{bindtextdomain} built-in function:
|
|
|
|
@example
|
|
BEGIN @{
|
|
TEXTDOMAIN = "guide" # our text domain
|
|
if (Testing) @{
|
|
# where to find our files
|
|
bindtextdomain("testdir")
|
|
# joe is in charge of adminprog
|
|
bindtextdomain("../joe/testdir", "adminprog")
|
|
@}
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@end enumerate
|
|
|
|
@xref{I18N Example},
|
|
for an example program showing the steps to create
|
|
and use translations from @command{awk}.
|
|
|
|
@node Translator i18n
|
|
@section Translating @command{awk} Programs
|
|
|
|
@cindex @code{.po} files
|
|
@cindex files, @code{.po}
|
|
@cindex portable object files
|
|
@cindex files, portable object
|
|
Once a program's translatable strings have been marked, they must
|
|
be extracted to create the initial @file{.po} file.
|
|
As part of translation, it is often helpful to rearrange the order
|
|
in which arguments to @code{printf} are output.
|
|
|
|
@command{gawk}'s @option{--gen-po} command-line option extracts
|
|
the messages and is discussed next.
|
|
After that, @code{printf}'s ability to
|
|
rearrange the order for @code{printf} arguments at runtime
|
|
is covered.
|
|
|
|
@menu
|
|
* String Extraction:: Extracting marked strings.
|
|
* Printf Ordering:: Rearranging @code{printf} arguments.
|
|
* I18N Portability:: @command{awk}-level portability issues.
|
|
@end menu
|
|
|
|
@node String Extraction
|
|
@subsection Extracting Marked Strings
|
|
@cindex strings, extracting
|
|
@c comma does NOT start secondary
|
|
@cindex marked strings, extracting
|
|
@cindex @code{--gen-po} option
|
|
@cindex command-line options, string extraction
|
|
@cindex string extraction (internationalization)
|
|
@cindex marked string extraction (internationalization)
|
|
@cindex extraction, of marked strings (internationalization)
|
|
|
|
@cindex @code{--gen-po} option
|
|
Once your @command{awk} program is working, and all the strings have
|
|
been marked and you've set (and perhaps bound) the text domain,
|
|
it is time to produce translations.
|
|
First, use the @option{--gen-po} command-line option to create
|
|
the initial @file{.po} file:
|
|
|
|
@example
|
|
$ gawk --gen-po -f guide.awk > guide.po
|
|
@end example
|
|
|
|
@cindex @code{xgettext} utility
|
|
When run with @option{--gen-po}, @command{gawk} does not execute your
|
|
program. Instead, it parses it as usual and prints all marked strings
|
|
to standard output in the format of a GNU @code{gettext} Portable Object
|
|
file. Also included in the output are any constant strings that
|
|
appear as the first argument to @code{dcgettext} or as the first and
|
|
second argument to @code{dcngettext}.@footnote{Starting with @code{gettext}
|
|
version 0.11.5, the @command{xgettext} utility that comes with GNU
|
|
@code{gettext} can handle @file{.awk} files.}
|
|
@xref{I18N Example},
|
|
for the full list of steps to go through to create and test
|
|
translations for @command{guide}.
|
|
|
|
@node Printf Ordering
|
|
@subsection Rearranging @code{printf} Arguments
|
|
|
|
@cindex @code{printf} statement, positional specifiers
|
|
@c comma does NOT start secondary
|
|
@cindex positional specifiers, @code{printf} statement
|
|
Format strings for @code{printf} and @code{sprintf}
|
|
(@pxref{Printf})
|
|
present a special problem for translation.
|
|
Consider the following:@footnote{This example is borrowed
|
|
from the GNU @code{gettext} manual.}
|
|
|
|
@c line broken here only for smallbook format
|
|
@example
|
|
printf(_"String `%s' has %d characters\n",
|
|
string, length(string)))
|
|
@end example
|
|
|
|
A possible German translation for this might be:
|
|
|
|
@example
|
|
"%d Zeichen lang ist die Zeichenkette `%s'\n"
|
|
@end example
|
|
|
|
The problem should be obvious: the order of the format
|
|
specifications is different from the original!
|
|
Even though @code{gettext} can return the translated string
|
|
at runtime,
|
|
it cannot change the argument order in the call to @code{printf}.
|
|
|
|
To solve this problem, @code{printf} format specificiers may have
|
|
an additional optional element, which we call a @dfn{positional specifier}.
|
|
For example:
|
|
|
|
@example
|
|
"%2$d Zeichen lang ist die Zeichenkette `%1$s'\n"
|
|
@end example
|
|
|
|
Here, the positional specifier consists of an integer count, which indicates which
|
|
argument to use, and a @samp{$}. Counts are one-based, and the
|
|
format string itself is @emph{not} included. Thus, in the following
|
|
example, @samp{string} is the first argument and @samp{length(string)} is the second:
|
|
|
|
@example
|
|
$ gawk 'BEGIN @{
|
|
> string = "Dont Panic"
|
|
> printf _"%2$d characters live in \"%1$s\"\n",
|
|
> string, length(string)
|
|
> @}'
|
|
@print{} 10 characters live in "Dont Panic"
|
|
@end example
|
|
|
|
If present, positional specifiers come first in the format specification,
|
|
before the flags, the field width, and/or the precision.
|
|
|
|
Positional specifiers can be used with the dynamic field width and
|
|
precision capability:
|
|
|
|
@example
|
|
$ gawk 'BEGIN @{
|
|
> printf("%*.*s\n", 10, 20, "hello")
|
|
> printf("%3$*2$.*1$s\n", 20, 10, "hello")
|
|
> @}'
|
|
@print{} hello
|
|
@print{} hello
|
|
@end example
|
|
|
|
@noindent
|
|
@strong{Note:} When using @samp{*} with a positional specifier, the @samp{*}
|
|
comes first, then the integer position, and then the @samp{$}.
|
|
This is somewhat counterintutive.
|
|
|
|
@cindex @code{printf} statement, positional specifiers, mixing with regular formats
|
|
@c first comma does is part of primary
|
|
@cindex positional specifiers, @code{printf} statement, mixing with regular formats
|
|
@cindex format specifiers, mixing regular with positional specifiers
|
|
@command{gawk} does not allow you to mix regular format specifiers
|
|
and those with positional specifiers in the same string:
|
|
|
|
@smallexample
|
|
$ gawk 'BEGIN @{ printf _"%d %3$s\n", 1, 2, "hi" @}'
|
|
@error{} gawk: cmd. line:1: fatal: must use `count$' on all formats or none
|
|
@end smallexample
|
|
|
|
@strong{Note:} There are some pathological cases that @command{gawk} may fail to
|
|
diagnose. In such cases, the output may not be what you expect.
|
|
It's still a bad idea to try mixing them, even if @command{gawk}
|
|
doesn't detect it.
|
|
|
|
Although positional specifiers can be used directly in @command{awk} programs,
|
|
their primary purpose is to help in producing correct translations of
|
|
format strings into languages different from the one in which the program
|
|
is first written.
|
|
|
|
@node I18N Portability
|
|
@subsection @command{awk} Portability Issues
|
|
|
|
@cindex portability, internationalization and
|
|
@cindex internationalization, localization, portability and
|
|
@command{gawk}'s internationalization features were purposely chosen to
|
|
have as little impact as possible on the portability of @command{awk}
|
|
programs that use them to other versions of @command{awk}.
|
|
Consider this program:
|
|
|
|
@example
|
|
BEGIN @{
|
|
TEXTDOMAIN = "guide"
|
|
if (Test_Guide) # set with -v
|
|
bindtextdomain("/test/guide/messages")
|
|
print _"don't panic!"
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
As written, it won't work on other versions of @command{awk}.
|
|
However, it is actually almost portable, requiring very little
|
|
change:
|
|
|
|
@itemize @bullet
|
|
@cindex @code{TEXTDOMAIN} variable, portability and
|
|
@item
|
|
Assignments to @code{TEXTDOMAIN} won't have any effect,
|
|
since @code{TEXTDOMAIN} is not special in other @command{awk} implementations.
|
|
|
|
@item
|
|
Non-GNU versions of @command{awk} treat marked strings
|
|
as the concatenation of a variable named @code{_} with the string
|
|
following it.@footnote{This is good fodder for an ``Obfuscated
|
|
@command{awk}'' contest.} Typically, the variable @code{_} has
|
|
the null string (@code{""}) as its value, leaving the original string constant as
|
|
the result.
|
|
|
|
@item
|
|
By defining ``dummy'' functions to replace @code{dcgettext}, @code{dcngettext}
|
|
and @code{bindtextdomain}, the @command{awk} program can be made to run, but
|
|
all the messages are output in the original language.
|
|
For example:
|
|
|
|
@cindex @code{bindtextdomain} function (@command{gawk}), portability and
|
|
@cindex @code{dcgettext} function (@command{gawk}), portability and
|
|
@cindex @code{dcngettext} function (@command{gawk}), portability and
|
|
@example
|
|
@c file eg/lib/libintl.awk
|
|
function bindtextdomain(dir, domain)
|
|
@{
|
|
return dir
|
|
@}
|
|
|
|
function dcgettext(string, domain, category)
|
|
@{
|
|
return string
|
|
@}
|
|
|
|
function dcngettext(string1, string2, number, domain, category)
|
|
@{
|
|
return (number == 1 ? string1 : string2)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@item
|
|
The use of positional specifications in @code{printf} or
|
|
@code{sprintf} is @emph{not} portable.
|
|
To support @code{gettext} at the C level, many systems' C versions of
|
|
@code{sprintf} do support positional specifiers. But it works only if
|
|
enough arguments are supplied in the function call. Many versions of
|
|
@command{awk} pass @code{printf} formats and arguments unchanged to the
|
|
underlying C library version of @code{sprintf}, but only one format and
|
|
argument at a time. What happens if a positional specification is
|
|
used is anybody's guess.
|
|
However, since the positional specifications are primarily for use in
|
|
@emph{translated} format strings, and since non-GNU @command{awk}s never
|
|
retrieve the translated string, this should not be a problem in practice.
|
|
@end itemize
|
|
@c ENDOFRANGE inap
|
|
|
|
@node I18N Example
|
|
@section A Simple Internationalization Example
|
|
|
|
Now let's look at a step-by-step example of how to internationalize and
|
|
localize a simple @command{awk} program, using @file{guide.awk} as our
|
|
original source:
|
|
|
|
@example
|
|
@c file eg/prog/guide.awk
|
|
BEGIN @{
|
|
TEXTDOMAIN = "guide"
|
|
bindtextdomain(".") # for testing
|
|
print _"Don't Panic"
|
|
print _"The Answer Is", 42
|
|
print "Pardon me, Zaphod who?"
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
Run @samp{gawk --gen-po} to create the @file{.po} file:
|
|
|
|
@example
|
|
$ gawk --gen-po -f guide.awk > guide.po
|
|
@end example
|
|
|
|
@noindent
|
|
This produces:
|
|
|
|
@example
|
|
@c file eg/data/guide.po
|
|
#: guide.awk:4
|
|
msgid "Don't Panic"
|
|
msgstr ""
|
|
|
|
#: guide.awk:5
|
|
msgid "The Answer Is"
|
|
msgstr ""
|
|
|
|
@c endfile
|
|
@end example
|
|
|
|
This original portable object file is saved and reused for each language
|
|
into which the application is translated. The @code{msgid}
|
|
is the original string and the @code{msgstr} is the translation.
|
|
|
|
@strong{Note:} Strings not marked with a leading underscore do not
|
|
appear in the @file{guide.po} file.
|
|
|
|
Next, the messages must be translated.
|
|
Here is a translation to a hypothetical dialect of English,
|
|
called ``Mellow'':@footnote{Perhaps it would be better if it were
|
|
called ``Hippy.'' Ah, well.}
|
|
|
|
@example
|
|
@group
|
|
$ cp guide.po guide-mellow.po
|
|
@var{Add translations to} guide-mellow.po @dots{}
|
|
@end group
|
|
@end example
|
|
|
|
@noindent
|
|
Following are the translations:
|
|
|
|
@example
|
|
@c file eg/data/guide-mellow.po
|
|
#: guide.awk:4
|
|
msgid "Don't Panic"
|
|
msgstr "Hey man, relax!"
|
|
|
|
#: guide.awk:5
|
|
msgid "The Answer Is"
|
|
msgstr "Like, the scoop is"
|
|
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex Linux
|
|
@cindex GNU/Linux
|
|
The next step is to make the directory to hold the binary message object
|
|
file and then to create the @file{guide.mo} file.
|
|
The directory layout shown here is standard for GNU @code{gettext} on
|
|
GNU/Linux systems. Other versions of @code{gettext} may use a different
|
|
layout:
|
|
|
|
@example
|
|
$ mkdir en_US en_US/LC_MESSAGES
|
|
@end example
|
|
|
|
@cindex @code{.po} files, converting to @code{.mo}
|
|
@cindex files, @code{.po}, converting to @code{.mo}
|
|
@cindex @code{.mo} files, converting from @code{.po}
|
|
@cindex files, @code{.mo}, converting from @code{.po}
|
|
@cindex portable object files, converting to message object files
|
|
@cindex files, portable object, converting to message object files
|
|
@cindex message object files, converting from portable object files
|
|
@cindex files, message object, converting from portable object files
|
|
@cindex @command{msgfmt} utility
|
|
The @command{msgfmt} utility does the conversion from human-readable
|
|
@file{.po} file to machine-readable @file{.mo} file.
|
|
By default, @command{msgfmt} creates a file named @file{messages}.
|
|
This file must be renamed and placed in the proper directory so that
|
|
@command{gawk} can find it:
|
|
|
|
@example
|
|
$ msgfmt guide-mellow.po
|
|
$ mv messages en_US/LC_MESSAGES/guide.mo
|
|
@end example
|
|
|
|
Finally, we run the program to test it:
|
|
|
|
@example
|
|
$ gawk -f guide.awk
|
|
@print{} Hey man, relax!
|
|
@print{} Like, the scoop is 42
|
|
@print{} Pardon me, Zaphod who?
|
|
@end example
|
|
|
|
If the three replacement functions for @code{dcgettext}, @code{dcngettext}
|
|
and @code{bindtextdomain}
|
|
(@pxref{I18N Portability})
|
|
are in a file named @file{libintl.awk},
|
|
then we can run @file{guide.awk} unchanged as follows:
|
|
|
|
@example
|
|
$ gawk --posix -f guide.awk -f libintl.awk
|
|
@print{} Don't Panic
|
|
@print{} The Answer Is 42
|
|
@print{} Pardon me, Zaphod who?
|
|
@end example
|
|
|
|
@node Gawk I18N
|
|
@section @command{gawk} Can Speak Your Language
|
|
|
|
As of @value{PVERSION} 3.1, @command{gawk} itself has been internationalized
|
|
using the GNU @code{gettext} package.
|
|
@ifinfo
|
|
(GNU @code{gettext} is described in
|
|
complete detail in
|
|
@ref{Top}.)
|
|
@end ifinfo
|
|
@ifnotinfo
|
|
(GNU @code{gettext} is described in
|
|
complete detail in
|
|
@cite{GNU gettext tools}.)
|
|
@end ifnotinfo
|
|
As of this writing, the latest version of GNU @code{gettext} is
|
|
@uref{ftp://ftp.gnu.org/gnu/gettext/gettext-0.11.5.tar.gz, @value{PVERSION} 0.11.5}.
|
|
|
|
If a translation of @command{gawk}'s messages exists,
|
|
then @command{gawk} produces usage messages, warnings,
|
|
and fatal errors in the local language.
|
|
|
|
@cindex @code{--with-included-gettext} configuration option
|
|
@cindex configuration option, @code{--with-included-gettext}
|
|
On systems that do not use @value{PVERSION} 2 (or later) of the GNU C library, you should
|
|
configure @command{gawk} with the @option{--with-included-gettext} option
|
|
before compiling and installing it.
|
|
@xref{Additional Configuration Options},
|
|
for more information.
|
|
@c ENDOFRANGE inloc
|
|
|
|
@node Advanced Features
|
|
@chapter Advanced Features of @command{gawk}
|
|
@cindex advanced features, network connections, See Also networks, connections
|
|
@c STARTOFRANGE gawadv
|
|
@cindex @command{gawk}, features, advanced
|
|
@c STARTOFRANGE advgaw
|
|
@cindex advanced features, @command{gawk}
|
|
@ignore
|
|
Contributed by: Peter Langston <pud!psl@bellcore.bellcore.com>
|
|
|
|
Found in Steve English's "signature" line:
|
|
|
|
"Write documentation as if whoever reads it is a violent psychopath
|
|
who knows where you live."
|
|
@end ignore
|
|
@quotation
|
|
@i{Write documentation as if whoever reads it is
|
|
a violent psychopath who knows where you live.}@*
|
|
Steve English, as quoted by Peter Langston
|
|
@end quotation
|
|
|
|
This @value{CHAPTER} discusses advanced features in @command{gawk}.
|
|
It's a bit of a ``grab bag'' of items that are otherwise unrelated
|
|
to each other.
|
|
First, a command-line option allows @command{gawk} to recognize
|
|
nondecimal numbers in input data, not just in @command{awk}
|
|
programs. Next, two-way I/O, discussed briefly in earlier parts of this
|
|
@value{DOCUMENT}, is described in full detail, along with the basics
|
|
of TCP/IP networking and BSD portal files. Finally, @command{gawk}
|
|
can @dfn{profile} an @command{awk} program, making it possible to tune
|
|
it for performance.
|
|
|
|
@ref{Dynamic Extensions},
|
|
discusses the ability to dynamically add new built-in functions to
|
|
@command{gawk}. As this feature is still immature and likely to change,
|
|
its description is relegated to an appendix.
|
|
|
|
@menu
|
|
* Nondecimal Data:: Allowing nondecimal input data.
|
|
* Two-way I/O:: Two-way communications with another process.
|
|
* TCP/IP Networking:: Using @command{gawk} for network programming.
|
|
* Portal Files:: Using @command{gawk} with BSD portals.
|
|
* Profiling:: Profiling your @command{awk} programs.
|
|
@end menu
|
|
|
|
@node Nondecimal Data
|
|
@section Allowing Nondecimal Input Data
|
|
@cindex @code{--non-decimal-data} option
|
|
@cindex advanced features, @command{gawk}, nondecimal input data
|
|
@c last comma does NOT start tertiary
|
|
@cindex input, data, nondecimal
|
|
@cindex constants, nondecimal
|
|
|
|
If you run @command{gawk} with the @option{--non-decimal-data} option,
|
|
you can have nondecimal constants in your input data:
|
|
|
|
@c line break here for small book format
|
|
@example
|
|
$ echo 0123 123 0x123 |
|
|
> gawk --non-decimal-data '@{ printf "%d, %d, %d\n",
|
|
> $1, $2, $3 @}'
|
|
@print{} 83, 123, 291
|
|
@end example
|
|
|
|
For this feature to work, write your program so that
|
|
@command{gawk} treats your data as numeric:
|
|
|
|
@example
|
|
$ echo 0123 123 0x123 | gawk '@{ print $1, $2, $3 @}'
|
|
@print{} 0123 123 0x123
|
|
@end example
|
|
|
|
@noindent
|
|
The @code{print} statement treats its expressions as strings.
|
|
Although the fields can act as numbers when necessary,
|
|
they are still strings, so @code{print} does not try to treat them
|
|
numerically. You may need to add zero to a field to force it to
|
|
be treated as a number. For example:
|
|
|
|
@example
|
|
$ echo 0123 123 0x123 | gawk --non-decimal-data '
|
|
> @{ print $1, $2, $3
|
|
> print $1 + 0, $2 + 0, $3 + 0 @}'
|
|
@print{} 0123 123 0x123
|
|
@print{} 83 123 291
|
|
@end example
|
|
|
|
Because it is common to have decimal data with leading zeros, and because
|
|
using it could lead to surprising results, the default is to leave this
|
|
facility disabled. If you want it, you must explicitly request it.
|
|
|
|
@cindex programming conventions, @code{--non-decimal-data} option
|
|
@cindex @code{--non-decimal-data} option, @code{strtonum} function and
|
|
@cindex @code{strtonum} function (@command{gawk}), @code{--non-decimal-data} option and
|
|
@strong{Caution:}
|
|
@emph{Use of this option is not recommended.}
|
|
It can break old programs very badly.
|
|
Instead, use the @code{strtonum} function to convert your data
|
|
(@pxref{Nondecimal-numbers}).
|
|
This makes your programs easier to write and easier to read, and
|
|
leads to less surprising results.
|
|
|
|
@node Two-way I/O
|
|
@section Two-Way Communications with Another Process
|
|
@cindex Brennan, Michael
|
|
@cindex programmers, attractiveness of
|
|
@smallexample
|
|
@c Path: cssun.mathcs.emory.edu!gatech!newsxfer3.itd.umich.edu!news-peer.sprintlink.net!news-sea-19.sprintlink.net!news-in-west.sprintlink.net!news.sprintlink.net!Sprint!204.94.52.5!news.whidbey.com!brennan
|
|
From: brennan@@whidbey.com (Mike Brennan)
|
|
Newsgroups: comp.lang.awk
|
|
Subject: Re: Learn the SECRET to Attract Women Easily
|
|
Date: 4 Aug 1997 17:34:46 GMT
|
|
@c Organization: WhidbeyNet
|
|
@c Lines: 12
|
|
Message-ID: <5s53rm$eca@@news.whidbey.com>
|
|
@c References: <5s20dn$2e1@chronicle.concentric.net>
|
|
@c Reply-To: brennan@whidbey.com
|
|
@c NNTP-Posting-Host: asn202.whidbey.com
|
|
@c X-Newsreader: slrn (0.9.4.1 UNIX)
|
|
@c Xref: cssun.mathcs.emory.edu comp.lang.awk:5403
|
|
|
|
On 3 Aug 1997 13:17:43 GMT, Want More Dates???
|
|
<tracy78@@kilgrona.com> wrote:
|
|
>Learn the SECRET to Attract Women Easily
|
|
>
|
|
>The SCENT(tm) Pheromone Sex Attractant For Men to Attract Women
|
|
|
|
The scent of awk programmers is a lot more attractive to women than
|
|
the scent of perl programmers.
|
|
--
|
|
Mike Brennan
|
|
@c brennan@@whidbey.com
|
|
@end smallexample
|
|
|
|
@c final comma is part of tertiary
|
|
@cindex advanced features, @command{gawk}, processes, communicating with
|
|
@cindex processes, two-way communications with
|
|
It is often useful to be able to
|
|
send data to a separate program for
|
|
processing and then read the result. This can always be
|
|
done with temporary files:
|
|
|
|
@example
|
|
# write the data for processing
|
|
tempfile = ("mydata." PROCINFO["pid"])
|
|
while (@var{not done with data})
|
|
print @var{data} | ("subprogram > " tempfile)
|
|
close("subprogram > " tempfile)
|
|
|
|
# read the results, remove tempfile when done
|
|
while ((getline newdata < tempfile) > 0)
|
|
@var{process} newdata @var{appropriately}
|
|
close(tempfile)
|
|
system("rm " tempfile)
|
|
@end example
|
|
|
|
@noindent
|
|
This works, but not elegantly. Among other things, it requires that
|
|
the program be run in a directory that cannot be shared among users;
|
|
for example, @file{/tmp} will not do, as another user might happen
|
|
to be using a temporary file with the same name.
|
|
|
|
@cindex coprocesses
|
|
@cindex input/output, two-way
|
|
@cindex @code{|} (vertical bar), @code{|&} operator (I/O)
|
|
@cindex vertical bar (@code{|}), @code{|&} I/O operator (I/O)
|
|
@cindex @command{csh} utility, @code{|&} operator, comparison with
|
|
Starting with @value{PVERSION} 3.1 of @command{gawk}, it is possible to
|
|
open a @emph{two-way} pipe to another process. The second process is
|
|
termed a @dfn{coprocess}, since it runs in parallel with @command{gawk}.
|
|
The two-way connection is created using the new @samp{|&} operator
|
|
(borrowed from the Korn shell, @command{ksh}):@footnote{This is very
|
|
different from the same operator in the C shell, @command{csh}.}
|
|
|
|
@example
|
|
do @{
|
|
print @var{data} |& "subprogram"
|
|
"subprogram" |& getline results
|
|
@} while (@var{data left to process})
|
|
close("subprogram")
|
|
@end example
|
|
|
|
The first time an I/O operation is executed using the @samp{|&}
|
|
operator, @command{gawk} creates a two-way pipeline to a child process
|
|
that runs the other program. Output created with @code{print}
|
|
or @code{printf} is written to the program's standard input, and
|
|
output from the program's standard output can be read by the @command{gawk}
|
|
program using @code{getline}.
|
|
As is the case with processes started by @samp{|}, the subprogram
|
|
can be any program, or pipeline of programs, that can be started by
|
|
the shell.
|
|
|
|
There are some cautionary items to be aware of:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
As the code inside @command{gawk} currently stands, the coprocess's
|
|
standard error goes to the same place that the parent @command{gawk}'s
|
|
standard error goes. It is not possible to read the child's
|
|
standard error separately.
|
|
|
|
@cindex deadlocks
|
|
@cindex buffering, input/output
|
|
@cindex @code{getline} command, deadlock and
|
|
@item
|
|
I/O buffering may be a problem. @command{gawk} automatically
|
|
flushes all output down the pipe to the child process.
|
|
However, if the coprocess does not flush its output,
|
|
@command{gawk} may hang when doing a @code{getline} in order to read
|
|
the coprocess's results. This could lead to a situation
|
|
known as @dfn{deadlock}, where each process is waiting for the
|
|
other one to do something.
|
|
@end itemize
|
|
|
|
@cindex @code{close} function, two-way pipes and
|
|
It is possible to close just one end of the two-way pipe to
|
|
a coprocess, by supplying a second argument to the @code{close}
|
|
function of either @code{"to"} or @code{"from"}
|
|
(@pxref{Close Files And Pipes}).
|
|
These strings tell @command{gawk} to close the end of the pipe
|
|
that sends data to the process or the end that reads from it,
|
|
respectively.
|
|
|
|
@cindex @command{sort} utility, coprocesses and
|
|
This is particularly necessary in order to use
|
|
the system @command{sort} utility as part of a coprocess;
|
|
@command{sort} must read @emph{all} of its input
|
|
data before it can produce any output.
|
|
The @command{sort} program does not receive an end-of-file indication
|
|
until @command{gawk} closes the write end of the pipe.
|
|
|
|
When you have finished writing data to the @command{sort}
|
|
utility, you can close the @code{"to"} end of the pipe, and
|
|
then start reading sorted data via @code{getline}.
|
|
For example:
|
|
|
|
@example
|
|
BEGIN @{
|
|
command = "LC_ALL=C sort"
|
|
n = split("abcdefghijklmnopqrstuvwxyz", a, "")
|
|
|
|
for (i = n; i > 0; i--)
|
|
print a[i] |& command
|
|
close(command, "to")
|
|
|
|
while ((command |& getline line) > 0)
|
|
print "got", line
|
|
close(command)
|
|
@}
|
|
@end example
|
|
|
|
This program writes the letters of the alphabet in reverse order, one
|
|
per line, down the two-way pipe to @command{sort}. It then closes the
|
|
write end of the pipe, so that @command{sort} receives an end-of-file
|
|
indication. This causes @command{sort} to sort the data and write the
|
|
sorted data back to the @command{gawk} program. Once all of the data
|
|
has been read, @command{gawk} terminates the coprocess and exits.
|
|
|
|
As a side note, the assignment @samp{LC_ALL=C} in the @command{sort}
|
|
command ensures traditional Unix (ASCII) sorting from @command{sort}.
|
|
|
|
Beginning with @command{gawk} 3.1.2, you may use Pseudo-ttys (ptys) for
|
|
two-way communication instead of pipes, if your system supports them.
|
|
This is done on a per-command basis, by setting a special element
|
|
in the @code{PROCINFO} array
|
|
(@pxref{Auto-set}),
|
|
like so:
|
|
|
|
@example
|
|
command = "sort -nr" # command, saved in variable for convenience
|
|
PROCINFO[command, "pty"] = 1 # update PROCINFO
|
|
print @dots{} |& command # start two-way pipe
|
|
@dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
Using ptys avoids the buffer deadlock issues described earlier, at some
|
|
loss in performance. If your system does not have ptys, or if all the
|
|
system's ptys are in use, @command{gawk} automatically falls back to
|
|
using regular pipes.
|
|
|
|
@node TCP/IP Networking
|
|
@section Using @command{gawk} for Network Programming
|
|
@cindex advanced features, @command{gawk}, network programming
|
|
@cindex networks, programming
|
|
@c STARTOFRANGE tcpip
|
|
@cindex TCP/IP
|
|
@cindex @code{/inet/} files (@command{gawk})
|
|
@cindex files, @code{/inet/} (@command{gawk})
|
|
@cindex @code{EMISTERED}
|
|
@quotation
|
|
@code{EMISTERED}: @i{A host is a host from coast to coast,@*
|
|
and no-one can talk to host that's close,@*
|
|
unless the host that isn't close@*
|
|
is busy hung or dead.}
|
|
@end quotation
|
|
|
|
In addition to being able to open a two-way pipeline to a coprocess
|
|
on the same system
|
|
(@pxref{Two-way I/O}),
|
|
it is possible to make a two-way connection to
|
|
another process on another system across an IP networking connection.
|
|
|
|
You can think of this as just a @emph{very long} two-way pipeline to
|
|
a coprocess.
|
|
The way @command{gawk} decides that you want to use TCP/IP networking is
|
|
by recognizing special @value{FN}s that begin with @samp{/inet/}.
|
|
|
|
The full syntax of the special @value{FN} is
|
|
@file{/inet/@var{protocol}/@var{local-port}/@var{remote-host}/@var{remote-port}}.
|
|
The components are:
|
|
|
|
@table @var
|
|
@item protocol
|
|
The protocol to use over IP. This must be either @samp{tcp},
|
|
@samp{udp}, or @samp{raw}, for a TCP, UDP, or raw IP connection,
|
|
respectively. The use of TCP is recommended for most applications.
|
|
|
|
@cindex raw sockets
|
|
@cindex sockets
|
|
@strong{Caution:} The use of raw sockets is not currently supported
|
|
in @value{PVERSION} 3.1 of @command{gawk}.
|
|
|
|
@item local-port
|
|
@cindex @code{getservbyname} function (C library)
|
|
The local TCP or UDP port number to use. Use a port number of @samp{0}
|
|
when you want the system to pick a port. This is what you should do
|
|
when writing a TCP or UDP client.
|
|
You may also use a well-known service name, such as @samp{smtp}
|
|
or @samp{http}, in which case @command{gawk} attempts to determine
|
|
the predefined port number using the C @code{getservbyname} function.
|
|
|
|
@item remote-host
|
|
The IP address or fully-qualified domain name of the Internet
|
|
host to which you want to connect.
|
|
|
|
@item remote-port
|
|
The TCP or UDP port number to use on the given @var{remote-host}.
|
|
Again, use @samp{0} if you don't care, or else a well-known
|
|
service name.
|
|
@end table
|
|
|
|
Consider the following very simple example:
|
|
|
|
@example
|
|
BEGIN @{
|
|
Service = "/inet/tcp/0/localhost/daytime"
|
|
Service |& getline
|
|
print $0
|
|
close(Service)
|
|
@}
|
|
@end example
|
|
|
|
This program reads the current date and time from the local system's
|
|
TCP @samp{daytime} server.
|
|
It then prints the results and closes the connection.
|
|
|
|
Because this topic is extensive, the use of @command{gawk} for
|
|
TCP/IP programming is documented separately.
|
|
@ifinfo
|
|
@xref{Top},
|
|
@end ifinfo
|
|
@ifnotinfo
|
|
See @cite{TCP/IP Internetworking with @command{gawk}},
|
|
which comes as part of the @command{gawk} distribution,
|
|
@end ifnotinfo
|
|
for a much more complete introduction and discussion, as well as
|
|
extensive examples.
|
|
|
|
@node Portal Files
|
|
@section Using @command{gawk} with BSD Portals
|
|
@cindex advanced features, @command{gawk}, BSD portals
|
|
@cindex portal files
|
|
@cindex files, portal
|
|
@cindex BSD portals
|
|
@cindex @code{/p} files (@command{gawk})
|
|
@cindex files, @code{/p} (@command{gawk})
|
|
@cindex @code{--enable-portals} configuration option
|
|
@cindex operating systems, BSD-based
|
|
|
|
Similar to the @file{/inet} special files, if @command{gawk}
|
|
is configured with the @option{--enable-portals} option
|
|
(@pxref{Quick Installation}),
|
|
then @command{gawk} treats
|
|
files whose pathnames begin with @code{/p} as 4.4 BSD-style portals.
|
|
|
|
@cindex @code{|} (vertical bar), @code{|&} operator (I/O), two-way communications
|
|
@cindex vertical bar (@code{|}), @code{|&} operator (I/O), two-way communications
|
|
When used with the @samp{|&} operator, @command{gawk} opens the file
|
|
for two-way communications. The operating system's portal mechanism
|
|
then manages creating the process associated with the portal and
|
|
the corresponding communications with the portal's process.
|
|
@c ENDOFRANGE tcpip
|
|
|
|
@node Profiling
|
|
@section Profiling Your @command{awk} Programs
|
|
@c STARTOFRANGE awkp
|
|
@cindex @command{awk} programs, profiling
|
|
@c STARTOFRANGE proawk
|
|
@cindex profiling @command{awk} programs
|
|
@c STARTOFRANGE pgawk
|
|
@cindex @command{pgawk} program
|
|
@cindex profiling @command{gawk}, See @command{pgawk} program
|
|
|
|
Beginning with @value{PVERSION} 3.1 of @command{gawk}, you may produce execution
|
|
traces of your @command{awk} programs.
|
|
This is done with a specially compiled version of @command{gawk},
|
|
called @command{pgawk} (``profiling @command{gawk}'').
|
|
|
|
@cindex @code{awkprof.out} file
|
|
@cindex files, @code{awkprof.out}
|
|
@cindex @command{pgawk} program, @code{awkprof.out} file
|
|
@command{pgawk} is identical in every way to @command{gawk}, except that when
|
|
it has finished running, it creates a profile of your program in a file
|
|
named @file{awkprof.out}.
|
|
Because it is profiling, it also executes up to 45% slower than
|
|
@command{gawk} normally does.
|
|
|
|
@cindex @code{--profile} option
|
|
As shown in the following example,
|
|
the @option{--profile} option can be used to change the name of the file
|
|
where @command{pgawk} will write the profile:
|
|
|
|
@example
|
|
$ pgawk --profile=myprog.prof -f myprog.awk data1 data2
|
|
@end example
|
|
|
|
@noindent
|
|
In the above example, @command{pgawk} places the profile in
|
|
@file{myprog.prof} instead of in @file{awkprof.out}.
|
|
|
|
Regular @command{gawk} also accepts this option. When called with just
|
|
@option{--profile}, @command{gawk} ``pretty prints'' the program into
|
|
@file{awkprof.out}, without any execution counts. You may supply an
|
|
option to @option{--profile} to change the @value{FN}. Here is a sample
|
|
session showing a simple @command{awk} program, its input data, and the
|
|
results from running @command{pgawk}. First, the @command{awk} program:
|
|
|
|
@example
|
|
BEGIN @{ print "First BEGIN rule" @}
|
|
|
|
END @{ print "First END rule" @}
|
|
|
|
/foo/ @{
|
|
print "matched /foo/, gosh"
|
|
for (i = 1; i <= 3; i++)
|
|
sing()
|
|
@}
|
|
|
|
@{
|
|
if (/foo/)
|
|
print "if is true"
|
|
else
|
|
print "else is true"
|
|
@}
|
|
|
|
BEGIN @{ print "Second BEGIN rule" @}
|
|
|
|
END @{ print "Second END rule" @}
|
|
|
|
function sing( dummy)
|
|
@{
|
|
print "I gotta be me!"
|
|
@}
|
|
@end example
|
|
|
|
Following is the input data:
|
|
|
|
@example
|
|
foo
|
|
bar
|
|
baz
|
|
foo
|
|
junk
|
|
@end example
|
|
|
|
Here is the @file{awkprof.out} that results from running @command{pgawk}
|
|
on this program and data (this example also illustrates that @command{awk}
|
|
programmers sometimes have to work late):
|
|
|
|
@cindex @code{BEGIN} pattern, @command{pgawk} program
|
|
@cindex @code{END} pattern, @command{pgawk} program
|
|
@example
|
|
# gawk profile, created Sun Aug 13 00:00:15 2000
|
|
|
|
# BEGIN block(s)
|
|
|
|
BEGIN @{
|
|
1 print "First BEGIN rule"
|
|
1 print "Second BEGIN rule"
|
|
@}
|
|
|
|
# Rule(s)
|
|
|
|
5 /foo/ @{ # 2
|
|
2 print "matched /foo/, gosh"
|
|
6 for (i = 1; i <= 3; i++) @{
|
|
6 sing()
|
|
@}
|
|
@}
|
|
|
|
5 @{
|
|
5 if (/foo/) @{ # 2
|
|
2 print "if is true"
|
|
3 @} else @{
|
|
3 print "else is true"
|
|
@}
|
|
@}
|
|
|
|
# END block(s)
|
|
|
|
END @{
|
|
1 print "First END rule"
|
|
1 print "Second END rule"
|
|
@}
|
|
|
|
# Functions, listed alphabetically
|
|
|
|
6 function sing(dummy)
|
|
@{
|
|
6 print "I gotta be me!"
|
|
@}
|
|
@end example
|
|
|
|
This example illustrates many of the basic rules for profiling output.
|
|
The rules are as follows:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The program is printed in the order @code{BEGIN} rule,
|
|
pattern/action rules, @code{END} rule and functions, listed
|
|
alphabetically.
|
|
Multiple @code{BEGIN} and @code{END} rules are merged together.
|
|
|
|
@cindex patterns, counts
|
|
@item
|
|
Pattern-action rules have two counts.
|
|
The first count, to the left of the rule, shows how many times
|
|
the rule's pattern was @emph{tested}.
|
|
The second count, to the right of the rule's opening left brace
|
|
in a comment,
|
|
shows how many times the rule's action was @emph{executed}.
|
|
The difference between the two indicates how many times the rule's
|
|
pattern evaluated to false.
|
|
|
|
@item
|
|
Similarly,
|
|
the count for an @code{if}-@code{else} statement shows how many times
|
|
the condition was tested.
|
|
To the right of the opening left brace for the @code{if}'s body
|
|
is a count showing how many times the condition was true.
|
|
The count for the @code{else}
|
|
indicates how many times the test failed.
|
|
|
|
@cindex loops, count for header
|
|
@item
|
|
The count for a loop header (such as @code{for}
|
|
or @code{while}) shows how many times the loop test was executed.
|
|
(Because of this, you can't just look at the count on the first
|
|
statement in a rule to determine how many times the rule was executed.
|
|
If the first statement is a loop, the count is misleading.)
|
|
|
|
@cindex functions, user-defined, counts
|
|
@cindex user-defined, functions, counts
|
|
@item
|
|
For user-defined functions, the count next to the @code{function}
|
|
keyword indicates how many times the function was called.
|
|
The counts next to the statements in the body show how many times
|
|
those statements were executed.
|
|
|
|
@cindex @code{@{@}} (braces), @command{pgawk} program
|
|
@cindex braces (@code{@{@}}), @command{pgawk} program
|
|
@item
|
|
The layout uses ``K&R'' style with tabs.
|
|
Braces are used everywhere, even when
|
|
the body of an @code{if}, @code{else}, or loop is only a single statement.
|
|
|
|
@cindex @code{()} (parentheses), @command{pgawk} program
|
|
@cindex parentheses @code{()}, @command{pgawk} program
|
|
@item
|
|
Parentheses are used only where needed, as indicated by the structure
|
|
of the program and the precedence rules.
|
|
@c extra verbiage here satisfies the copyeditor. ugh.
|
|
For example, @samp{(3 + 5) * 4} means add three plus five, then multiply
|
|
the total by four. However, @samp{3 + 5 * 4} has no parentheses, and
|
|
means @samp{3 + (5 * 4)}.
|
|
|
|
@item
|
|
All string concatenations are parenthesized too.
|
|
(This could be made a bit smarter.)
|
|
|
|
@item
|
|
Parentheses are used around the arguments to @code{print}
|
|
and @code{printf} only when
|
|
the @code{print} or @code{printf} statement is followed by a redirection.
|
|
Similarly, if
|
|
the target of a redirection isn't a scalar, it gets parenthesized.
|
|
|
|
@item
|
|
@command{pgawk} supplies leading comments in
|
|
front of the @code{BEGIN} and @code{END} rules,
|
|
the pattern/action rules, and the functions.
|
|
|
|
@end itemize
|
|
|
|
The profiled version of your program may not look exactly like what you
|
|
typed when you wrote it. This is because @command{pgawk} creates the
|
|
profiled version by ``pretty printing'' its internal representation of
|
|
the program. The advantage to this is that @command{pgawk} can produce
|
|
a standard representation. The disadvantage is that all source-code
|
|
comments are lost, as are the distinctions among multiple @code{BEGIN}
|
|
and @code{END} rules. Also, things such as:
|
|
|
|
@example
|
|
/foo/
|
|
@end example
|
|
|
|
@noindent
|
|
come out as:
|
|
|
|
@example
|
|
/foo/ @{
|
|
print $0
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
which is correct, but possibly surprising.
|
|
|
|
@cindex profiling @command{awk} programs, dynamically
|
|
@cindex @command{pgawk} program, dynamic profiling
|
|
Besides creating profiles when a program has completed,
|
|
@command{pgawk} can produce a profile while it is running.
|
|
This is useful if your @command{awk} program goes into an
|
|
infinite loop and you want to see what has been executed.
|
|
To use this feature, run @command{pgawk} in the background:
|
|
|
|
@example
|
|
$ pgawk -f myprog &
|
|
[1] 13992
|
|
@end example
|
|
|
|
@c comma does NOT start secondary
|
|
@cindex @command{kill} command, dynamic profiling
|
|
@cindex @code{USR1} signal
|
|
@cindex signals, @code{USR1}/@code{SIGUSR1}
|
|
@noindent
|
|
The shell prints a job number and process ID number; in this case, 13992.
|
|
Use the @command{kill} command to send the @code{USR1} signal
|
|
to @command{pgawk}:
|
|
|
|
@example
|
|
$ kill -USR1 13992
|
|
@end example
|
|
|
|
@noindent
|
|
As usual, the profiled version of the program is written to
|
|
@file{awkprof.out}, or to a different file if you use the @option{--profile}
|
|
option.
|
|
|
|
Along with the regular profile, as shown earlier, the profile
|
|
includes a trace of any active functions:
|
|
|
|
@example
|
|
# Function Call Stack:
|
|
|
|
# 3. baz
|
|
# 2. bar
|
|
# 1. foo
|
|
# -- main --
|
|
@end example
|
|
|
|
You may send @command{pgawk} the @code{USR1} signal as many times as you like.
|
|
Each time, the profile and function call trace are appended to the output
|
|
profile file.
|
|
|
|
@cindex @code{HUP} signal
|
|
@cindex signals, @code{HUP}/@code{SIGHUP}
|
|
If you use the @code{HUP} signal instead of the @code{USR1} signal,
|
|
@command{pgawk} produces the profile and the function call trace and then exits.
|
|
|
|
@cindex @code{INT} signal (MS-DOS)
|
|
@cindex signals, @code{INT}/@code{SIGINT} (MS-DOS)
|
|
@cindex @code{QUIT} signal (MS-DOS)
|
|
@cindex signals, @code{QUIT}/@code{SIGQUIT} (MS-DOS)
|
|
When @command{pgawk} runs on MS-DOS or MS-Windows, it uses the
|
|
@code{INT} and @code{QUIT} signals for producing the profile and, in
|
|
the case of the @code{INT} signal, @command{pgawk} exits. This is
|
|
because these systems don't support the @command{kill} command, so the
|
|
only signals you can deliver to a program are those generated by the
|
|
keyboard. The @code{INT} signal is generated by the
|
|
@kbd{@value{CTL}-@key{C}} or @kbd{@value{CTL}-@key{BREAK}} key, while the
|
|
@code{QUIT} signal is generated by the @kbd{@value{CTL}-@key{\}} key.
|
|
@c ENDOFRANGE advgaw
|
|
@c ENDOFRANGE gawadv
|
|
@c ENDOFRANGE pgawk
|
|
@c ENDOFRANGE awkp
|
|
@c ENDOFRANGE proawk
|
|
|
|
@node Invoking Gawk
|
|
@chapter Running @command{awk} and @command{gawk}
|
|
|
|
This @value{CHAPTER} covers how to run awk, both POSIX-standard
|
|
and @command{gawk}-specific command-line options, and what
|
|
@command{awk} and
|
|
@command{gawk} do with non-option arguments.
|
|
It then proceeds to cover how @command{gawk} searches for source files,
|
|
obsolete options and/or features, and known bugs in @command{gawk}.
|
|
This @value{CHAPTER} rounds out the discussion of @command{awk}
|
|
as a program and as a language.
|
|
|
|
While a number of the options and features described here were
|
|
discussed in passing earlier in the book, this @value{CHAPTER} provides the
|
|
full details.
|
|
|
|
@menu
|
|
* Command Line:: How to run @command{awk}.
|
|
* Options:: Command-line options and their meanings.
|
|
* Other Arguments:: Input file names and variable assignments.
|
|
* AWKPATH Variable:: Searching directories for @command{awk}
|
|
programs.
|
|
* Obsolete:: Obsolete Options and/or features.
|
|
* Undocumented:: Undocumented Options and Features.
|
|
* Known Bugs:: Known Bugs in @command{gawk}.
|
|
@end menu
|
|
|
|
@node Command Line
|
|
@section Invoking @command{awk}
|
|
@cindex command line, invoking @command{awk} from
|
|
@cindex @command{awk}, invoking
|
|
@cindex arguments, command-line, invoking @command{awk}
|
|
@cindex options, command-line, invoking @command{awk}
|
|
|
|
There are two ways to run @command{awk}---with an explicit program or with
|
|
one or more program files. Here are templates for both of them; items
|
|
enclosed in [@dots{}] in these templates are optional:
|
|
|
|
@example
|
|
awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{}
|
|
awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
|
|
@end example
|
|
|
|
@cindex GNU long options
|
|
@cindex long options
|
|
@cindex options, long
|
|
Besides traditional one-letter POSIX-style options, @command{gawk} also
|
|
supports GNU long options.
|
|
|
|
@cindex dark corner, invoking @command{awk}
|
|
@cindex lint checking, empty programs
|
|
It is possible to invoke @command{awk} with an empty program:
|
|
|
|
@example
|
|
awk '' datafile1 datafile2
|
|
@end example
|
|
|
|
@cindex @code{--lint} option
|
|
@noindent
|
|
Doing so makes little sense, though; @command{awk} exits
|
|
silently when given an empty program.
|
|
@value{DARKCORNER}
|
|
If @option{--lint} has
|
|
been specified on the command line, @command{gawk} issues a
|
|
warning that the program is empty.
|
|
|
|
@node Options
|
|
@section Command-Line Options
|
|
@c STARTOFRANGE ocl
|
|
@cindex options, command-line
|
|
@c STARTOFRANGE clo
|
|
@cindex command line, options
|
|
@c STARTOFRANGE gnulo
|
|
@cindex GNU long options
|
|
@c STARTOFRANGE longo
|
|
@cindex options, long
|
|
|
|
Options begin with a dash and consist of a single character.
|
|
GNU-style long options consist of two dashes and a keyword.
|
|
The keyword can be abbreviated, as long as the abbreviation allows the option
|
|
to be uniquely identified. If the option takes an argument, then the
|
|
keyword is either immediately followed by an equals sign (@samp{=}) and the
|
|
argument's value, or the keyword and the argument's value are separated
|
|
by whitespace.
|
|
If a particular option with a value is given more than once, it is the
|
|
last value that counts.
|
|
|
|
@cindex POSIX @command{awk}, GNU long options and
|
|
Each long option for @command{gawk} has a corresponding
|
|
POSIX-style option.
|
|
The long and short options are
|
|
interchangeable in all contexts.
|
|
The options and their meanings are as follows:
|
|
|
|
@table @code
|
|
@item -F @var{fs}
|
|
@itemx --field-separator @var{fs}
|
|
@cindex @code{-F} option
|
|
@cindex @code{--field-separator} option
|
|
@cindex @code{FS} variable, @code{--field-separator} option and
|
|
Sets the @code{FS} variable to @var{fs}
|
|
(@pxref{Field Separators}).
|
|
|
|
@item -f @var{source-file}
|
|
@itemx --file @var{source-file}
|
|
@cindex @code{-f} option
|
|
@cindex @code{--file} option
|
|
@cindex @command{awk} programs, location of
|
|
Indicates that the @command{awk} program is to be found in @var{source-file}
|
|
instead of in the first non-option argument.
|
|
|
|
@item -v @var{var}=@var{val}
|
|
@itemx --assign @var{var}=@var{val}
|
|
@cindex @code{-v} option
|
|
@cindex @code{--assign} option
|
|
@cindex variables, setting
|
|
Sets the variable @var{var} to the value @var{val} @emph{before}
|
|
execution of the program begins. Such variable values are available
|
|
inside the @code{BEGIN} rule
|
|
(@pxref{Other Arguments}).
|
|
|
|
The @option{-v} option can only set one variable, but it can be used
|
|
more than once, setting another variable each time, like this:
|
|
@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}.
|
|
|
|
@c last comma is part of secondary
|
|
@cindex built-in variables, @code{-v} option, setting with
|
|
@c last comma is part of tertiary
|
|
@cindex variables, built-in, @code{-v} option, setting with
|
|
@strong{Caution:} Using @option{-v} to set the values of the built-in
|
|
variables may lead to surprising results. @command{awk} will reset the
|
|
values of those variables as it needs to, possibly ignoring any
|
|
predefined value you may have given.
|
|
|
|
@item -mf @var{N}
|
|
@itemx -mr @var{N}
|
|
@cindex @code{-mf}/@code{-mr} options
|
|
@cindex memory, setting limits
|
|
Sets various memory limits to the value @var{N}. The @samp{f} flag sets
|
|
the maximum number of fields and the @samp{r} flag sets the maximum
|
|
record size. These two flags and the @option{-m} option are from the
|
|
Bell Laboratories research version of Unix @command{awk}. They are provided
|
|
for compatibility but otherwise ignored by
|
|
@command{gawk}, since @command{gawk} has no predefined limits.
|
|
(The Bell Laboratories @command{awk} no longer needs these options;
|
|
it continues to accept them to avoid breaking old programs.)
|
|
|
|
@item -W @var{gawk-opt}
|
|
@cindex @code{-W} option
|
|
Following the POSIX standard, implementation-specific
|
|
options are supplied as arguments to the @option{-W} option. These options
|
|
also have corresponding GNU-style long options.
|
|
Note that the long options may be abbreviated, as long as
|
|
the abbreviations remain unique.
|
|
The full list of @command{gawk}-specific options is provided next.
|
|
|
|
@item --
|
|
@cindex command line, options, end of
|
|
@cindex options, command-line, end of
|
|
Signals the end of the command-line options. The following arguments
|
|
are not treated as options even if they begin with @samp{-}. This
|
|
interpretation of @option{--} follows the POSIX argument parsing
|
|
conventions.
|
|
|
|
@cindex @code{-} (hyphen), filenames beginning with
|
|
@cindex hyphen (@code{-}), filenames beginning with
|
|
This is useful if you have @value{FN}s that start with @samp{-},
|
|
or in shell scripts, if you have @value{FN}s that will be specified
|
|
by the user that could start with @samp{-}.
|
|
@end table
|
|
@c ENDOFRANGE gnulo
|
|
@c ENDOFRANGE longo
|
|
|
|
The previous list described options mandated by the POSIX standard,
|
|
as well as options available in the Bell Laboratories version of @command{awk}.
|
|
The following list describes @command{gawk}-specific options:
|
|
|
|
@table @code
|
|
@item -W compat
|
|
@itemx -W traditional
|
|
@itemx --compat
|
|
@itemx --traditional
|
|
@cindex @code{--compat} option
|
|
@cindex @code{--traditional} option
|
|
@cindex compatibility mode (@command{gawk}), specifying
|
|
Specifies @dfn{compatibility mode}, in which the GNU extensions to
|
|
the @command{awk} language are disabled, so that @command{gawk} behaves just
|
|
like the Bell Laboratories research version of Unix @command{awk}.
|
|
@option{--traditional} is the preferred form of this option.
|
|
@xref{POSIX/GNU},
|
|
which summarizes the extensions. Also see
|
|
@ref{Compatibility Mode}.
|
|
|
|
@item -W copyright
|
|
@itemx --copyright
|
|
@cindex @code{--copyright} option
|
|
@cindex GPL (General Public License), printing
|
|
Print the short version of the General Public License and then exit.
|
|
|
|
@item -W copyleft
|
|
@itemx --copyleft
|
|
@cindex @code{--copyleft} option
|
|
Just like @option{--copyright}.
|
|
This option may disappear in a future version of @command{gawk}.
|
|
|
|
@cindex @code{--dump-variables} option
|
|
@cindex @code{awkvars.out} file
|
|
@cindex files, @code{awkvars.out}
|
|
@cindex variables, global, printing list of
|
|
@item -W dump-variables@r{[}=@var{file}@r{]}
|
|
@itemx --dump-variables@r{[}=@var{file}@r{]}
|
|
Prints a sorted list of global variables, their types, and final values
|
|
to @var{file}. If no @var{file} is provided, @command{gawk} prints this
|
|
list to the file named @file{awkvars.out} in the current directory.
|
|
|
|
@c last comma is part of secondary
|
|
@cindex troubleshooting, typographical errors, global variables
|
|
Having a list of all global variables is a good way to look for
|
|
typographical errors in your programs.
|
|
You would also use this option if you have a large program with a lot of
|
|
functions, and you want to be sure that your functions don't
|
|
inadvertently use global variables that you meant to be local.
|
|
(This is a particularly easy mistake to make with simple variable
|
|
names like @code{i}, @code{j}, etc.)
|
|
|
|
@item -W gen-po
|
|
@itemx --gen-po
|
|
@cindex @code{--gen-po} option
|
|
@cindex portable object files, generating
|
|
@cindex files, portable object, generating
|
|
Analyzes the source program and
|
|
generates a GNU @code{gettext} Portable Object file on standard
|
|
output for all string constants that have been marked for translation.
|
|
@xref{Internationalization},
|
|
for information about this option.
|
|
|
|
@item -W help
|
|
@itemx -W usage
|
|
@itemx --help
|
|
@itemx --usage
|
|
@cindex @code{--help} option
|
|
@cindex @code{--usage} option
|
|
@cindex GNU long options, printing list of
|
|
@cindex options, printing list of
|
|
@cindex printing, list of options
|
|
Prints a ``usage'' message summarizing the short and long style options
|
|
that @command{gawk} accepts and then exit.
|
|
|
|
@item -W lint@r{[}=fatal@r{]}
|
|
@itemx --lint@r{[}=fatal@r{]}
|
|
@cindex @code{--lint} option
|
|
@cindex lint checking, issuing warnings
|
|
@cindex warnings, issuing
|
|
Warns about constructs that are dubious or nonportable to
|
|
other @command{awk} implementations.
|
|
Some warnings are issued when @command{gawk} first reads your program. Others
|
|
are issued at runtime, as your program executes.
|
|
With an optional argument of @samp{fatal},
|
|
lint warnings become fatal errors.
|
|
This may be drastic, but its use will certainly encourage the
|
|
development of cleaner @command{awk} programs.
|
|
With an optional argument of @samp{invalid}, only warnings about things that are
|
|
actually invalid are issued. (This is not fully implemented yet.)
|
|
|
|
@item -W lint-old
|
|
@itemx --lint-old
|
|
@cindex @code{--lint-old} option
|
|
Warns about constructs that are not available in the original version of
|
|
@command{awk} from Version 7 Unix
|
|
(@pxref{V7/SVR3.1}).
|
|
|
|
@item -W non-decimal-data
|
|
@itemx --non-decimal-data
|
|
@cindex @code{--non-decimal-data} option
|
|
@cindex hexadecimal, values, enabling interpretation of
|
|
@c comma is part of primary
|
|
@cindex octal values, enabling interpretation of
|
|
Enable automatic interpretation of octal and hexadecimal
|
|
values in input data
|
|
(@pxref{Nondecimal Data}).
|
|
|
|
@cindex troubleshooting, @code{--non-decimal-data} option
|
|
@strong{Caution:} This option can severely break old programs.
|
|
Use with care.
|
|
|
|
@item -W posix
|
|
@itemx --posix
|
|
@cindex @code{--posix} option
|
|
@cindex POSIX mode
|
|
@c last comma is part of tertiary
|
|
@cindex @command{gawk}, extensions, disabling
|
|
Operates in strict POSIX mode. This disables all @command{gawk}
|
|
extensions (just like @option{--traditional}) and adds the following additional
|
|
restrictions:
|
|
|
|
@c IMPORTANT! Keep this list in sync with the one in node POSIX
|
|
|
|
@itemize @bullet
|
|
@cindex escape sequences, unrecognized
|
|
@item
|
|
@code{\x} escape sequences are not recognized
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@cindex newlines
|
|
@cindex whitespace, newlines as
|
|
@item
|
|
Newlines do not act as whitespace to separate fields when @code{FS} is
|
|
equal to a single space
|
|
(@pxref{Fields}).
|
|
|
|
@item
|
|
Newlines are not allowed after @samp{?} or @samp{:}
|
|
(@pxref{Conditional Exp}).
|
|
|
|
@item
|
|
The synonym @code{func} for the keyword @code{function} is not
|
|
recognized (@pxref{Definition Syntax}).
|
|
|
|
@cindex @code{*} (asterisk), @code{**} operator
|
|
@cindex asterisk (@code{*}), @code{**} operator
|
|
@cindex @code{*} (asterisk), @code{**=} operator
|
|
@cindex asterisk (@code{*}), @code{**=} operator
|
|
@cindex @code{^} (caret), @code{^} operator
|
|
@cindex caret (@code{^}), @code{^} operator
|
|
@cindex @code{^} (caret), @code{^=} operator
|
|
@cindex caret (@code{^}), @code{^=} operator
|
|
@item
|
|
The @samp{**} and @samp{**=} operators cannot be used in
|
|
place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops},
|
|
and also @pxref{Assignment Ops}).
|
|
|
|
@cindex @code{FS} variable, as TAB character
|
|
@item
|
|
Specifying @samp{-Ft} on the command-line does not set the value
|
|
of @code{FS} to be a single TAB character
|
|
(@pxref{Field Separators}).
|
|
|
|
@c comma does not start secondary
|
|
@cindex @code{fflush} function, unsupported
|
|
@item
|
|
The @code{fflush} built-in function is not supported
|
|
(@pxref{I/O Functions}).
|
|
@end itemize
|
|
|
|
@c @cindex automatic warnings
|
|
@c @cindex warnings, automatic
|
|
@cindex @code{--traditional} option, @code{--posix} option and
|
|
@cindex @code{--posix} option, @code{--traditional} option and
|
|
If you supply both @option{--traditional} and @option{--posix} on the
|
|
command line, @option{--posix} takes precedence. @command{gawk}
|
|
also issues a warning if both options are supplied.
|
|
|
|
@item -W profile@r{[}=@var{file}@r{]}
|
|
@itemx --profile@r{[}=@var{file}@r{]}
|
|
@cindex @code{--profile} option
|
|
@cindex @command{awk} programs, profiling, enabling
|
|
Enable profiling of @command{awk} programs
|
|
(@pxref{Profiling}).
|
|
By default, profiles are created in a file named @file{awkprof.out}.
|
|
The optional @var{file} argument allows you to specify a different
|
|
@value{FN} for the profile file.
|
|
|
|
When run with @command{gawk}, the profile is just a ``pretty printed'' version
|
|
of the program. When run with @command{pgawk}, the profile contains execution
|
|
counts for each statement in the program in the left margin, and function
|
|
call counts for each function.
|
|
|
|
@item -W re-interval
|
|
@itemx --re-interval
|
|
@cindex @code{--re-interval} option
|
|
@cindex regular expressions, interval expressions and
|
|
Allows interval expressions
|
|
(@pxref{Regexp Operators})
|
|
in regexps.
|
|
Because interval expressions were traditionally not available in @command{awk},
|
|
@command{gawk} does not provide them by default. This prevents old @command{awk}
|
|
programs from breaking.
|
|
|
|
@item -W source @var{program-text}
|
|
@itemx --source @var{program-text}
|
|
@cindex @code{--source} option
|
|
@cindex source code, mixing
|
|
Allows you to mix source code in files with source
|
|
code that you enter on the command line.
|
|
Program source code is taken from the @var{program-text}.
|
|
This is particularly useful
|
|
when you have library functions that you want to use from your command-line
|
|
programs (@pxref{AWKPATH Variable}).
|
|
|
|
@item -W version
|
|
@itemx --version
|
|
@cindex @code{--version} option
|
|
@c last comma is part of tertiary
|
|
@cindex @command{gawk}, versions of, information about, printing
|
|
Prints version information for this particular copy of @command{gawk}.
|
|
This allows you to determine if your copy of @command{gawk} is up to date
|
|
with respect to whatever the Free Software Foundation is currently
|
|
distributing.
|
|
It is also useful for bug reports
|
|
(@pxref{Bugs}).
|
|
@end table
|
|
|
|
As long as program text has been supplied,
|
|
any other options are flagged as invalid with a warning message but
|
|
are otherwise ignored.
|
|
|
|
@cindex @code{-F} option, @code{-Ft} sets @code{FS} to TAB
|
|
In compatibility mode, as a special case, if the value of @var{fs} supplied
|
|
to the @option{-F} option is @samp{t}, then @code{FS} is set to the TAB
|
|
character (@code{"\t"}). This is true only for @option{--traditional} and not
|
|
for @option{--posix}
|
|
(@pxref{Field Separators}).
|
|
|
|
@cindex @code{-f} option, on command line
|
|
The @option{-f} option may be used more than once on the command line.
|
|
If it is, @command{awk} reads its program source from all of the named files, as
|
|
if they had been concatenated together into one big file. This is
|
|
useful for creating libraries of @command{awk} functions. These functions
|
|
can be written once and then retrieved from a standard place, instead
|
|
of having to be included into each individual program.
|
|
(As mentioned in
|
|
@ref{Definition Syntax},
|
|
function names must be unique.)
|
|
|
|
Library functions can still be used, even if the program is entered at the terminal,
|
|
by specifying @samp{-f /dev/tty}. After typing your program,
|
|
type @kbd{@value{CTL}-d} (the end-of-file character) to terminate it.
|
|
(You may also use @samp{-f -} to read program source from the standard
|
|
input but then you will not be able to also use the standard input as a
|
|
source of data.)
|
|
|
|
Because it is clumsy using the standard @command{awk} mechanisms to mix source
|
|
file and command-line @command{awk} programs, @command{gawk} provides the
|
|
@option{--source} option. This does not require you to pre-empt the standard
|
|
input for your source code; it allows you to easily mix command-line
|
|
and library source code
|
|
(@pxref{AWKPATH Variable}).
|
|
|
|
@cindex @code{--source} option
|
|
If no @option{-f} or @option{--source} option is specified, then @command{gawk}
|
|
uses the first non-option command-line argument as the text of the
|
|
program source code.
|
|
|
|
@cindex @code{POSIXLY_CORRECT} environment variable
|
|
@cindex lint checking, @code{POSIXLY_CORRECT} environment variable
|
|
@cindex POSIX mode
|
|
If the environment variable @env{POSIXLY_CORRECT} exists,
|
|
then @command{gawk} behaves in strict POSIX mode, exactly as if
|
|
you had supplied the @option{--posix} command-line option.
|
|
Many GNU programs look for this environment variable to turn on
|
|
strict POSIX mode. If @option{--lint} is supplied on the command line
|
|
and @command{gawk} turns on POSIX mode because of @env{POSIXLY_CORRECT},
|
|
then it issues a warning message indicating that POSIX
|
|
mode is in effect.
|
|
You would typically set this variable in your shell's startup file.
|
|
For a Bourne-compatible shell (such as @command{bash}), you would add these
|
|
lines to the @file{.profile} file in your home directory:
|
|
|
|
@example
|
|
POSIXLY_CORRECT=true
|
|
export POSIXLY_CORRECT
|
|
@end example
|
|
|
|
@cindex @command{csh} utility, @code{POSIXLY_CORRECT} environment variable
|
|
For a @command{csh}-compatible
|
|
shell,@footnote{Not recommended.}
|
|
you would add this line to the @file{.login} file in your home directory:
|
|
|
|
@example
|
|
setenv POSIXLY_CORRECT true
|
|
@end example
|
|
|
|
@cindex portability, @code{POSIXLY_CORRECT} environment variable
|
|
Having @env{POSIXLY_CORRECT} set is not recommended for daily use,
|
|
but it is good for testing the portability of your programs to other
|
|
environments.
|
|
@c ENDOFRANGE ocl
|
|
@c ENDOFRANGE clo
|
|
|
|
@node Other Arguments
|
|
@section Other Command-Line Arguments
|
|
@cindex command line, arguments
|
|
@cindex arguments, command-line
|
|
|
|
Any additional arguments on the command line are normally treated as
|
|
input files to be processed in the order specified. However, an
|
|
argument that has the form @code{@var{var}=@var{value}}, assigns
|
|
the value @var{value} to the variable @var{var}---it does not specify a
|
|
file at all.
|
|
(This was discussed earlier in
|
|
@ref{Assignment Options}.)
|
|
|
|
@cindex @code{ARGIND} variable, command-line arguments
|
|
@cindex @code{ARGC}/@code{ARGV} variables, command-line arguments
|
|
All these arguments are made available to your @command{awk} program in the
|
|
@code{ARGV} array (@pxref{Built-in Variables}). Command-line options
|
|
and the program text (if present) are omitted from @code{ARGV}.
|
|
All other arguments, including variable assignments, are
|
|
included. As each element of @code{ARGV} is processed, @command{gawk}
|
|
sets the variable @code{ARGIND} to the index in @code{ARGV} of the
|
|
current element.
|
|
|
|
@cindex input files, variable assignments and
|
|
The distinction between @value{FN} arguments and variable-assignment
|
|
arguments is made when @command{awk} is about to open the next input file.
|
|
At that point in execution, it checks the @value{FN} to see whether
|
|
it is really a variable assignment; if so, @command{awk} sets the variable
|
|
instead of reading a file.
|
|
|
|
Therefore, the variables actually receive the given values after all
|
|
previously specified files have been read. In particular, the values of
|
|
variables assigned in this fashion are @emph{not} available inside a
|
|
@code{BEGIN} rule
|
|
(@pxref{BEGIN/END}),
|
|
because such rules are run before @command{awk} begins scanning the argument list.
|
|
|
|
@cindex dark corner, escape sequences
|
|
The variable values given on the command line are processed for escape
|
|
sequences (@pxref{Escape Sequences}).
|
|
@value{DARKCORNER}
|
|
|
|
In some earlier implementations of @command{awk}, when a variable assignment
|
|
occurred before any @value{FN}s, the assignment would happen @emph{before}
|
|
the @code{BEGIN} rule was executed. @command{awk}'s behavior was thus
|
|
inconsistent; some command-line assignments were available inside the
|
|
@code{BEGIN} rule, while others were not. Unfortunately,
|
|
some applications came to depend
|
|
upon this ``feature.'' When @command{awk} was changed to be more consistent,
|
|
the @option{-v} option was added to accommodate applications that depended
|
|
upon the old behavior.
|
|
|
|
The variable assignment feature is most useful for assigning to variables
|
|
such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and
|
|
output formats before scanning the @value{DF}s. It is also useful for
|
|
controlling state if multiple passes are needed over a @value{DF}. For
|
|
example:
|
|
|
|
@cindex files, multiple passes over
|
|
@example
|
|
awk 'pass == 1 @{ @var{pass 1 stuff} @}
|
|
pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata
|
|
@end example
|
|
|
|
Given the variable assignment feature, the @option{-F} option for setting
|
|
the value of @code{FS} is not
|
|
strictly necessary. It remains for historical compatibility.
|
|
|
|
@node AWKPATH Variable
|
|
@section The @env{AWKPATH} Environment Variable
|
|
@cindex @env{AWKPATH} environment variable
|
|
@cindex directories, searching
|
|
@cindex search paths, for source files
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{AWKPATH} environment variable
|
|
@ifinfo
|
|
The previous @value{SECTION} described how @command{awk} program files can be named
|
|
on the command-line with the @option{-f} option.
|
|
@end ifinfo
|
|
In most @command{awk}
|
|
implementations, you must supply a precise path name for each program
|
|
file, unless the file is in the current directory.
|
|
But in @command{gawk}, if the @value{FN} supplied to the @option{-f} option
|
|
does not contain a @samp{/}, then @command{gawk} searches a list of
|
|
directories (called the @dfn{search path}), one by one, looking for a
|
|
file with the specified name.
|
|
|
|
The search path is a string consisting of directory names
|
|
separated by colons. @command{gawk} gets its search path from the
|
|
@env{AWKPATH} environment variable. If that variable does not exist,
|
|
@command{gawk} uses a default path,
|
|
@samp{.:/usr/local/share/awk}.@footnote{Your version of @command{gawk}
|
|
may use a different directory; it
|
|
will depend upon how @command{gawk} was built and installed. The actual
|
|
directory is the value of @samp{$(datadir)} generated when
|
|
@command{gawk} was configured. You probably don't need to worry about this,
|
|
though.} (Programs written for use by
|
|
system administrators should use an @env{AWKPATH} variable that
|
|
does not include the current directory, @file{.}.)
|
|
|
|
The search path feature is particularly useful for building libraries
|
|
of useful @command{awk} functions. The library files can be placed in a
|
|
standard directory in the default path and then specified on
|
|
the command line with a short @value{FN}. Otherwise, the full @value{FN}
|
|
would have to be typed for each file.
|
|
|
|
By using both the @option{--source} and @option{-f} options, your command-line
|
|
@command{awk} programs can use facilities in @command{awk} library files
|
|
(@pxref{Library Functions}).
|
|
Path searching is not done if @command{gawk} is in compatibility mode.
|
|
This is true for both @option{--traditional} and @option{--posix}.
|
|
@xref{Options}.
|
|
|
|
@strong{Note:} If you want files in the current directory to be found,
|
|
you must include the current directory in the path, either by including
|
|
@file{.} explicitly in the path or by writing a null entry in the
|
|
path. (A null entry is indicated by starting or ending the path with a
|
|
colon or by placing two colons next to each other (@samp{::}).) If the
|
|
current directory is not included in the path, then files cannot be
|
|
found in the current directory. This path search mechanism is identical
|
|
to the shell's.
|
|
@c someday, @cite{The Bourne Again Shell}....
|
|
|
|
Starting with @value{PVERSION} 3.0, if @env{AWKPATH} is not defined in the
|
|
environment, @command{gawk} places its default search path into
|
|
@code{ENVIRON["AWKPATH"]}. This makes it easy to determine
|
|
the actual search path that @command{gawk} will use
|
|
from within an @command{awk} program.
|
|
|
|
While you can change @code{ENVIRON["AWKPATH"]} within your @command{awk}
|
|
program, this has no effect on the running program's behavior. This makes
|
|
sense: the @env{AWKPATH} environment variable is used to find the program
|
|
source files. Once your program is running, all the files have been
|
|
found, and @command{gawk} no longer needs to use @env{AWKPATH}.
|
|
|
|
@node Obsolete
|
|
@section Obsolete Options and/or Features
|
|
|
|
@cindex features, advanced, See advanced features
|
|
@cindex options, deprecated
|
|
@cindex features, deprecated
|
|
@cindex obsolete features
|
|
This @value{SECTION} describes features and/or command-line options from
|
|
previous releases of @command{gawk} that are either not available in the
|
|
current version or that are still supported but deprecated (meaning that
|
|
they will @emph{not} be in the next release).
|
|
|
|
@c update this section for each release!
|
|
|
|
@cindex @code{next file} statement, deprecated
|
|
@cindex @code{nextfile} statement, @code{next file} statement and
|
|
For @value{PVERSION} @value{VERSION} of @command{gawk}, there are no
|
|
deprecated command-line options
|
|
@c or other deprecated features
|
|
from the previous version of @command{gawk}.
|
|
The use of @samp{next file} (two words) for @code{nextfile} was deprecated
|
|
in @command{gawk} 3.0 but still worked. Starting with @value{PVERSION} 3.1, the
|
|
two-word usage is no longer accepted.
|
|
|
|
The process-related special files described in
|
|
@ref{Special Process},
|
|
work as described, but
|
|
are now considered deprecated.
|
|
@command{gawk} prints a warning message every time they are used.
|
|
(Use @code{PROCINFO} instead; see
|
|
@ref{Auto-set}.)
|
|
They will be removed from the next release of @command{gawk}.
|
|
|
|
@ignore
|
|
This @value{SECTION}
|
|
is thus essentially a place holder,
|
|
in case some option becomes obsolete in a future version of @command{gawk}.
|
|
@end ignore
|
|
|
|
@node Undocumented
|
|
@section Undocumented Options and Features
|
|
@cindex undocumented features
|
|
@cindex features, undocumented
|
|
@cindex Skywalker, Luke
|
|
@cindex Kenobi, Obi-Wan
|
|
@cindex Jedi knights
|
|
@cindex Knights, jedi
|
|
@quotation
|
|
@i{Use the Source, Luke!}@*
|
|
Obi-Wan
|
|
@end quotation
|
|
|
|
This @value{SECTION} intentionally left
|
|
blank.
|
|
|
|
@ignore
|
|
@c If these came out in the Info file or TeX document, then they wouldn't
|
|
@c be undocumented, would they?
|
|
|
|
@command{gawk} has one undocumented option:
|
|
|
|
@table @code
|
|
@item -W nostalgia
|
|
@itemx --nostalgia
|
|
Print the message @code{"awk: bailing out near line 1"} and dump core.
|
|
This option was inspired by the common behavior of very early versions of
|
|
Unix @command{awk} and by a t--shirt.
|
|
The message is @emph{not} subject to translation in non-English locales.
|
|
@c so there! nyah, nyah.
|
|
@end table
|
|
|
|
Early versions of @command{awk} used to not require any separator (either
|
|
a newline or @samp{;}) between the rules in @command{awk} programs. Thus,
|
|
it was common to see one-line programs like:
|
|
|
|
@example
|
|
awk '@{ sum += $1 @} END @{ print sum @}'
|
|
@end example
|
|
|
|
@command{gawk} actually supports this but it is purposely undocumented
|
|
because it is considered bad style. The correct way to write such a program
|
|
is either
|
|
|
|
@example
|
|
awk '@{ sum += $1 @} ; END @{ print sum @}'
|
|
@end example
|
|
|
|
@noindent
|
|
or
|
|
|
|
@example
|
|
awk '@{ sum += $1 @}
|
|
END @{ print sum @}' data
|
|
@end example
|
|
|
|
@noindent
|
|
@xref{Statements/Lines}, for a fuller
|
|
explanation.
|
|
|
|
You can insert newlines after the @samp{;} in @code{for} loops.
|
|
This seems to have been a long-undocumented feature in Unix @command{awk}.
|
|
|
|
Similarly, you may use @code{print} or @code{printf} statements in the
|
|
@var{init} and @var{increment} parts of a @code{for} loop. This is another
|
|
long-undocumented ``feature'' of Unix @code{awk}.
|
|
|
|
If the environment variable @env{WHINY_USERS} exists
|
|
when @command{gawk} is run,
|
|
then the associative @code{for} loop will go through the array
|
|
indices in sorted order.
|
|
The comparison used for sorting is simple string comparison;
|
|
any non-English or non-ASCII locales are not taken into account.
|
|
@code{IGNORECASE} does not affect the comparison either.
|
|
|
|
In addition, if @env{WHINY_USERS} is set, the profiled version of a
|
|
program generated by @option{--profile} will print all 8-bit characters
|
|
verbatim, instead of using the octal equivalent.
|
|
|
|
@end ignore
|
|
|
|
@node Known Bugs
|
|
@section Known Bugs in @command{gawk}
|
|
@cindex @command{gawk}, debugging
|
|
@cindex debugging @command{gawk}
|
|
@cindex troubleshooting, @command{gawk}
|
|
|
|
@itemize @bullet
|
|
@cindex troubleshooting, @code{-F} option
|
|
@cindex @code{-F} option, troubleshooting
|
|
@cindex @code{FS} variable, changing value of
|
|
@item
|
|
The @option{-F} option for changing the value of @code{FS}
|
|
(@pxref{Options})
|
|
is not necessary given the command-line variable
|
|
assignment feature; it remains only for backward compatibility.
|
|
|
|
@item
|
|
Syntactically invalid single-character programs tend to overflow
|
|
the parse stack, generating a rather unhelpful message. Such programs
|
|
are surprisingly difficult to diagnose in the completely general case,
|
|
and the effort to do so really is not worth it.
|
|
@end itemize
|
|
|
|
@ignore
|
|
@c Try this
|
|
@iftex
|
|
@page
|
|
@headings off
|
|
@majorheading II@ @ @ Using @command{awk} and @command{gawk}
|
|
Part II shows how to use @command{awk} and @command{gawk} for problem solving.
|
|
There is lots of code here for you to read and learn from.
|
|
It contains the following chapters:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@ref{Library Functions}.
|
|
|
|
@item
|
|
@ref{Sample Programs}.
|
|
|
|
@end itemize
|
|
|
|
@page
|
|
@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
|
|
@oddheading @| @| @strong{@thischapter}@ @ @ @thispage
|
|
@end iftex
|
|
@end ignore
|
|
|
|
@node Library Functions
|
|
@chapter A Library of @command{awk} Functions
|
|
@c STARTOFRANGE libf
|
|
@cindex libraries of @command{awk} functions
|
|
@c STARTOFRANGE flib
|
|
@cindex functions, library
|
|
@c STARTOFRANGE fudlib
|
|
@cindex functions, user-defined, library of
|
|
|
|
@ref{User-defined}, describes how to write
|
|
your own @command{awk} functions. Writing functions is important, because
|
|
it allows you to encapsulate algorithms and program tasks in a single
|
|
place. It simplifies programming, making program development more
|
|
manageable, and making programs more readable.
|
|
|
|
One valuable way to learn a new programming language is to @emph{read}
|
|
programs in that language. To that end, this @value{CHAPTER}
|
|
and @ref{Sample Programs},
|
|
provide a good-sized body of code for you to read,
|
|
and hopefully, to learn from.
|
|
|
|
@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
|
|
This @value{CHAPTER} presents a library of useful @command{awk} functions.
|
|
Many of the sample programs presented later in this @value{DOCUMENT}
|
|
use these functions.
|
|
The functions are presented here in a progression from simple to complex.
|
|
|
|
@cindex Texinfo
|
|
@ref{Extract Program},
|
|
presents a program that you can use to extract the source code for
|
|
these example library functions and programs from the Texinfo source
|
|
for this @value{DOCUMENT}.
|
|
(This has already been done as part of the @command{gawk} distribution.)
|
|
|
|
If you have written one or more useful, general-purpose @command{awk} functions
|
|
and would like to contribute them to the author's collection of @command{awk}
|
|
programs, see
|
|
@ref{How To Contribute}, for more information.
|
|
|
|
@cindex portability, example programs
|
|
The programs in this @value{CHAPTER} and in
|
|
@ref{Sample Programs},
|
|
freely use features that are @command{gawk}-specific.
|
|
Rewriting these programs for different implementations of awk is pretty straightforward.
|
|
|
|
Diagnostic error messages are sent to @file{/dev/stderr}.
|
|
Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"} if your system
|
|
does not have a @file{/dev/stderr}, or if you cannot use @command{gawk}.
|
|
|
|
A number of programs use @code{nextfile}
|
|
(@pxref{Nextfile Statement})
|
|
to skip any remaining input in the input file.
|
|
@ref{Nextfile Function},
|
|
shows you how to write a function that does the same thing.
|
|
|
|
@c 12/2000: Thanks to Nelson Beebe for pointing out the output issue.
|
|
@cindex case sensitivity, example programs
|
|
@cindex @code{IGNORECASE} variable, in example programs
|
|
Finally, some of the programs choose to ignore upper- and lowercase
|
|
distinctions in their input. They do so by assigning one to @code{IGNORECASE}.
|
|
You can achieve almost the same effect@footnote{The effects are
|
|
not identical. Output of the transformed
|
|
record will be in all lowercase, while @code{IGNORECASE} preserves the original
|
|
contents of the input record.} by adding the following rule to the
|
|
beginning of the program:
|
|
|
|
@example
|
|
# ignore case
|
|
@{ $0 = tolower($0) @}
|
|
@end example
|
|
|
|
@noindent
|
|
Also, verify that all regexp and string constants used in
|
|
comparisons use only lowercase letters.
|
|
|
|
@menu
|
|
* Library Names:: How to best name private global variables in
|
|
library functions.
|
|
* General Functions:: Functions that are of general use.
|
|
* Data File Management:: Functions for managing command-line data
|
|
files.
|
|
* Getopt Function:: A function for processing command-line
|
|
arguments.
|
|
* Passwd Functions:: Functions for getting user information.
|
|
* Group Functions:: Functions for getting group information.
|
|
@end menu
|
|
|
|
@node Library Names
|
|
@section Naming Library Function Global Variables
|
|
|
|
@cindex names, arrays/variables
|
|
@cindex names, functions
|
|
@cindex namespace issues
|
|
@cindex @command{awk} programs, documenting
|
|
@cindex documentation, of @command{awk} programs
|
|
Due to the way the @command{awk} language evolved, variables are either
|
|
@dfn{global} (usable by the entire program) or @dfn{local} (usable just by
|
|
a specific function). There is no intermediate state analogous to
|
|
@code{static} variables in C.
|
|
|
|
@cindex variables, global, for library functions
|
|
@cindex private variables
|
|
@cindex variables, private
|
|
Library functions often need to have global variables that they can use to
|
|
preserve state information between calls to the function---for example,
|
|
@code{getopt}'s variable @code{_opti}
|
|
(@pxref{Getopt Function}).
|
|
Such variables are called @dfn{private}, since the only functions that need to
|
|
use them are the ones in the library.
|
|
|
|
When writing a library function, you should try to choose names for your
|
|
private variables that will not conflict with any variables used by
|
|
either another library function or a user's main program. For example, a
|
|
name like @samp{i} or @samp{j} is not a good choice, because user programs
|
|
often use variable names like these for their own purposes.
|
|
|
|
@cindex programming conventions, private variable names
|
|
The example programs shown in this @value{CHAPTER} all start the names of their
|
|
private variables with an underscore (@samp{_}). Users generally don't use
|
|
leading underscores in their variable names, so this convention immediately
|
|
decreases the chances that the variable name will be accidentally shared
|
|
with the user's program.
|
|
|
|
@cindex @code{_} (underscore), in names of private variables
|
|
@cindex underscore (@code{_}), in names of private variables
|
|
In addition, several of the library functions use a prefix that helps
|
|
indicate what function or set of functions use the variables---for example,
|
|
@code{_pw_byname} in the user database routines
|
|
(@pxref{Passwd Functions}).
|
|
This convention is recommended, since it even further decreases the
|
|
chance of inadvertent conflict among variable names. Note that this
|
|
convention is used equally well for variable names and for private
|
|
function names as well.@footnote{While all the library routines could have
|
|
been rewritten to use this convention, this was not done, in order to
|
|
show how my own @command{awk} programming style has evolved and to
|
|
provide some basis for this discussion.}
|
|
|
|
As a final note on variable naming, if a function makes global variables
|
|
available for use by a main program, it is a good convention to start that
|
|
variable's name with a capital letter---for
|
|
example, @code{getopt}'s @code{Opterr} and @code{Optind} variables
|
|
(@pxref{Getopt Function}).
|
|
The leading capital letter indicates that it is global, while the fact that
|
|
the variable name is not all capital letters indicates that the variable is
|
|
not one of @command{awk}'s built-in variables, such as @code{FS}.
|
|
|
|
@cindex @code{--dump-variables} option
|
|
It is also important that @emph{all} variables in library
|
|
functions that do not need to save state are, in fact, declared
|
|
local.@footnote{@command{gawk}'s @option{--dump-variables} command-line
|
|
option is useful for verifying this.} If this is not done, the variable
|
|
could accidentally be used in the user's program, leading to bugs that
|
|
are very difficult to track down:
|
|
|
|
@example
|
|
function lib_func(x, y, l1, l2)
|
|
@{
|
|
@dots{}
|
|
@var{use variable} some_var # some_var should be local
|
|
@dots{} # but is not by oversight
|
|
@}
|
|
@end example
|
|
|
|
@cindex arrays, associative, library functions and
|
|
@cindex libraries of @command{awk} functions, associative arrays and
|
|
@cindex functions, library, associative arrays and
|
|
@cindex Tcl
|
|
A different convention, common in the Tcl community, is to use a single
|
|
associative array to hold the values needed by the library function(s), or
|
|
``package.'' This significantly decreases the number of actual global names
|
|
in use. For example, the functions described in
|
|
@ref{Passwd Functions},
|
|
might have used array elements @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}},
|
|
@code{@w{PW_data["count"]}}, and @code{@w{PW_data["awklib"]}}, instead of
|
|
@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}},
|
|
and @code{@w{_pw_count}}.
|
|
|
|
The conventions presented in this @value{SECTION} are exactly
|
|
that: conventions. You are not required to write your programs this
|
|
way---we merely recommend that you do so.
|
|
|
|
@node General Functions
|
|
@section General Programming
|
|
|
|
This @value{SECTION} presents a number of functions that are of general
|
|
programming use.
|
|
|
|
@menu
|
|
* Nextfile Function:: Two implementations of a @code{nextfile}
|
|
function.
|
|
* Assert Function:: A function for assertions in @command{awk}
|
|
programs.
|
|
* Round Function:: A function for rounding if @code{sprintf} does
|
|
not do it correctly.
|
|
* Cliff Random Function:: The Cliff Random Number Generator.
|
|
* Ordinal Functions:: Functions for using characters as numbers and
|
|
vice versa.
|
|
* Join Function:: A function to join an array into a string.
|
|
* Gettimeofday Function:: A function to get formatted times.
|
|
@end menu
|
|
|
|
@node Nextfile Function
|
|
@subsection Implementing @code{nextfile} as a Function
|
|
|
|
@cindex input files, skipping
|
|
@c STARTOFRANGE libfnex
|
|
@cindex libraries of @command{awk} functions, @code{nextfile} statement
|
|
@c STARTOFRANGE flibnex
|
|
@cindex functions, library, @code{nextfile} statement
|
|
@c STARTOFRANGE nexim
|
|
@cindex @code{nextfile} statement, implementing
|
|
@cindex @command{gawk}, @code{nextfile} statement in
|
|
The @code{nextfile} statement, presented in
|
|
@ref{Nextfile Statement},
|
|
is a @command{gawk}-specific extension---it is not available in most other
|
|
implementations of @command{awk}. This @value{SECTION} shows two versions of a
|
|
@code{nextfile} function that you can use to simulate @command{gawk}'s
|
|
@code{nextfile} statement if you cannot use @command{gawk}.
|
|
|
|
A first attempt at writing a @code{nextfile} function is as follows:
|
|
|
|
@example
|
|
# nextfile --- skip remaining records in current file
|
|
# this should be read in before the "main" awk program
|
|
|
|
function nextfile() @{ _abandon_ = FILENAME; next @}
|
|
_abandon_ == FILENAME @{ next @}
|
|
@end example
|
|
|
|
@cindex programming conventions, @code{nextfile} statement
|
|
Because it supplies a rule that must be executed first, this file should
|
|
be included before the main program. This rule compares the current
|
|
@value{DF}'s name (which is always in the @code{FILENAME} variable) to
|
|
a private variable named @code{_abandon_}. If the @value{FN} matches,
|
|
then the action part of the rule executes a @code{next} statement to
|
|
go on to the next record. (The use of @samp{_} in the variable name is
|
|
a convention. It is discussed more fully in
|
|
@ref{Library Names}.)
|
|
|
|
The use of the @code{next} statement effectively creates a loop that reads
|
|
all the records from the current @value{DF}.
|
|
The end of the file is eventually reached and
|
|
a new @value{DF} is opened, changing the value of @code{FILENAME}.
|
|
Once this happens, the comparison of @code{_abandon_} to @code{FILENAME}
|
|
fails, and execution continues with the first rule of the ``real'' program.
|
|
|
|
The @code{nextfile} function itself simply sets the value of @code{_abandon_}
|
|
and then executes a @code{next} statement to start the
|
|
loop.
|
|
@ignore
|
|
@c If the function can't be used on other versions of awk, this whole
|
|
@c section is pointless, no? Sigh.
|
|
@footnote{@command{gawk} is the only known @command{awk} implementation
|
|
that allows you to
|
|
execute @code{next} from within a function body. Some other workaround
|
|
is necessary if you are not using @command{gawk}.}
|
|
@end ignore
|
|
|
|
@cindex @code{nextfile} user-defined function
|
|
This initial version has a subtle problem.
|
|
If the same @value{DF} is listed @emph{twice} on the commandline,
|
|
one right after the other
|
|
or even with just a variable assignment between them,
|
|
this code skips right through the file a second time, even though
|
|
it should stop when it gets to the end of the first occurrence.
|
|
A second version of @code{nextfile} that remedies this problem
|
|
is shown here:
|
|
|
|
@example
|
|
@c file eg/lib/nextfile.awk
|
|
# nextfile --- skip remaining records in current file
|
|
# correctly handle successive occurrences of the same file
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/nextfile.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May, 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/nextfile.awk
|
|
# this should be read in before the "main" awk program
|
|
|
|
function nextfile() @{ _abandon_ = FILENAME; next @}
|
|
|
|
_abandon_ == FILENAME @{
|
|
if (FNR == 1)
|
|
_abandon_ = ""
|
|
else
|
|
next
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{nextfile} function has not changed. It makes @code{_abandon_}
|
|
equal to the current @value{FN} and then executes a @code{next} statement.
|
|
The @code{next} statement reads the next record and increments @code{FNR}
|
|
so that @code{FNR} is guaranteed to have a value of at least two.
|
|
However, if @code{nextfile} is called for the last record in the file,
|
|
then @command{awk} closes the current @value{DF} and moves on to the next
|
|
one. Upon doing so, @code{FILENAME} is set to the name of the new file
|
|
and @code{FNR} is reset to one. If this next file is the same as
|
|
the previous one, @code{_abandon_} is still equal to @code{FILENAME}.
|
|
However, @code{FNR} is equal to one, telling us that this is a new
|
|
occurrence of the file and not the one we were reading when the
|
|
@code{nextfile} function was executed. In that case, @code{_abandon_}
|
|
is reset to the empty string, so that further executions of this rule
|
|
fail (until the next time that @code{nextfile} is called).
|
|
|
|
If @code{FNR} is not one, then we are still in the original @value{DF}
|
|
and the program executes a @code{next} statement to skip through it.
|
|
|
|
An important question to ask at this point is: given that the
|
|
functionality of @code{nextfile} can be provided with a library file,
|
|
why is it built into @command{gawk}? Adding
|
|
features for little reason leads to larger, slower programs that are
|
|
harder to maintain.
|
|
The answer is that building @code{nextfile} into @command{gawk} provides
|
|
significant gains in efficiency. If the @code{nextfile} function is executed
|
|
at the beginning of a large @value{DF}, @command{awk} still has to scan the entire
|
|
file, splitting it up into records,
|
|
@c at least conceptually
|
|
just to skip over it. The built-in
|
|
@code{nextfile} can simply close the file immediately and proceed to the
|
|
next one, which saves a lot of time. This is particularly important in
|
|
@command{awk}, because @command{awk} programs are generally I/O-bound (i.e.,
|
|
they spend most of their time doing input and output, instead of performing
|
|
computations).
|
|
@c ENDOFRANGE libfnex
|
|
@c ENDOFRANGE flibnex
|
|
@c ENDOFRANGE nexim
|
|
|
|
@node Assert Function
|
|
@subsection Assertions
|
|
|
|
@c STARTOFRANGE asse
|
|
@cindex assertions
|
|
@c STARTOFRANGE assef
|
|
@cindex @code{assert} function (C library)
|
|
@c STARTOFRANGE libfass
|
|
@cindex libraries of @command{awk} functions, assertions
|
|
@c STARTOFRANGE flibass
|
|
@cindex functions, library, assertions
|
|
@cindex @command{awk} programs, lengthy, assertions
|
|
When writing large programs, it is often useful to know
|
|
that a condition or set of conditions is true. Before proceeding with a
|
|
particular computation, you make a statement about what you believe to be
|
|
the case. Such a statement is known as an
|
|
@dfn{assertion}. The C language provides an @code{<assert.h>} header file
|
|
and corresponding @code{assert} macro that the programmer can use to make
|
|
assertions. If an assertion fails, the @code{assert} macro arranges to
|
|
print a diagnostic message describing the condition that should have
|
|
been true but was not, and then it kills the program. In C, using
|
|
@code{assert} looks this:
|
|
|
|
@example
|
|
#include <assert.h>
|
|
|
|
int myfunc(int a, double b)
|
|
@{
|
|
assert(a <= 5 && b >= 17.1);
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
If the assertion fails, the program prints a message similar to this:
|
|
|
|
@example
|
|
prog.c:5: assertion failed: a <= 5 && b >= 17.1
|
|
@end example
|
|
|
|
@cindex @code{assert} user-defined function
|
|
The C language makes it possible to turn the condition into a string for use
|
|
in printing the diagnostic message. This is not possible in @command{awk}, so
|
|
this @code{assert} function also requires a string version of the condition
|
|
that is being tested.
|
|
Following is the function:
|
|
|
|
@example
|
|
@c file eg/lib/assert.awk
|
|
# assert --- assert that a condition is true. Otherwise exit.
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/assert.awk
|
|
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May, 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/assert.awk
|
|
function assert(condition, string)
|
|
@{
|
|
if (! condition) @{
|
|
printf("%s:%d: assertion failed: %s\n",
|
|
FILENAME, FNR, string) > "/dev/stderr"
|
|
_assert_exit = 1
|
|
exit 1
|
|
@}
|
|
@}
|
|
|
|
@group
|
|
END @{
|
|
if (_assert_exit)
|
|
exit 1
|
|
@}
|
|
@end group
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{assert} function tests the @code{condition} parameter. If it
|
|
is false, it prints a message to standard error, using the @code{string}
|
|
parameter to describe the failed condition. It then sets the variable
|
|
@code{_assert_exit} to one and executes the @code{exit} statement.
|
|
The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
|
|
rules finds @code{_assert_exit} to be true, it exits immediately.
|
|
|
|
The purpose of the test in the @code{END} rule is to
|
|
keep any other @code{END} rules from running. When an assertion fails, the
|
|
program should exit immediately.
|
|
If no assertions fail, then @code{_assert_exit} is still
|
|
false when the @code{END} rule is run normally, and the rest of the
|
|
program's @code{END} rules execute.
|
|
For all of this to work correctly, @file{assert.awk} must be the
|
|
first source file read by @command{awk}.
|
|
The function can be used in a program in the following way:
|
|
|
|
@example
|
|
function myfunc(a, b)
|
|
@{
|
|
assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1")
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
If the assertion fails, you see a message similar to the following:
|
|
|
|
@example
|
|
mydata:1357: assertion failed: a <= 5 && b >= 17.1
|
|
@end example
|
|
|
|
@cindex @code{END} pattern, @code{assert} user-defined function and
|
|
There is a small problem with this version of @code{assert}.
|
|
An @code{END} rule is automatically added
|
|
to the program calling @code{assert}. Normally, if a program consists
|
|
of just a @code{BEGIN} rule, the input files and/or standard input are
|
|
not read. However, now that the program has an @code{END} rule, @command{awk}
|
|
attempts to read the input @value{DF}s or standard input
|
|
(@pxref{Using BEGIN/END}),
|
|
most likely causing the program to hang as it waits for input.
|
|
|
|
@cindex @code{BEGIN} pattern, @code{assert} user-defined function and
|
|
There is a simple workaround to this:
|
|
make sure the @code{BEGIN} rule always ends
|
|
with an @code{exit} statement.
|
|
@c ENDOFRANGE asse
|
|
@c ENDOFRANGE assef
|
|
@c ENDOFRANGE flibass
|
|
@c ENDOFRANGE libfass
|
|
|
|
@node Round Function
|
|
@subsection Rounding Numbers
|
|
|
|
@cindex rounding
|
|
@cindex rounding numbers
|
|
@cindex numbers, rounding
|
|
@cindex libraries of @command{awk} functions, rounding numbers
|
|
@cindex functions, library, rounding numbers
|
|
@cindex @code{print} statement, @code{sprintf} function and
|
|
@cindex @code{printf} statement, @code{sprintf} function and
|
|
@cindex @code{sprintf} function, @code{print}/@code{printf} statements and
|
|
The way @code{printf} and @code{sprintf}
|
|
(@pxref{Printf})
|
|
perform rounding often depends upon the system's C @code{sprintf}
|
|
subroutine. On many machines, @code{sprintf} rounding is ``unbiased,''
|
|
which means it doesn't always round a trailing @samp{.5} up, contrary
|
|
to naive expectations. In unbiased rounding, @samp{.5} rounds to even,
|
|
rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means
|
|
that if you are using a format that does rounding (e.g., @code{"%.0f"}),
|
|
you should check what your system does. The following function does
|
|
traditional rounding; it might be useful if your awk's @code{printf}
|
|
does unbiased rounding:
|
|
|
|
@cindex @code{round} user-defined function
|
|
@example
|
|
@c file eg/lib/round.awk
|
|
# round.awk --- do normal rounding
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/round.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# August, 1996
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/round.awk
|
|
function round(x, ival, aval, fraction)
|
|
@{
|
|
ival = int(x) # integer part, int() truncates
|
|
|
|
# see if fractional part
|
|
if (ival == x) # no fraction
|
|
return x
|
|
|
|
if (x < 0) @{
|
|
aval = -x # absolute value
|
|
ival = int(aval)
|
|
fraction = aval - ival
|
|
if (fraction >= .5)
|
|
return int(x) - 1 # -2.5 --> -3
|
|
else
|
|
return int(x) # -2.3 --> -2
|
|
@} else @{
|
|
fraction = x - ival
|
|
if (fraction >= .5)
|
|
return ival + 1
|
|
else
|
|
return ival
|
|
@}
|
|
@}
|
|
|
|
# test harness
|
|
@{ print $0, round($0) @}
|
|
@c endfile
|
|
@end example
|
|
|
|
@node Cliff Random Function
|
|
@subsection The Cliff Random Number Generator
|
|
@cindex random numbers, Cliff
|
|
@cindex Cliff random numbers
|
|
@cindex numbers, Cliff random
|
|
@cindex functions, library, Cliff random numbers
|
|
|
|
The Cliff random number
|
|
generator@footnote{@uref{http://mathworld.wolfram.com/CliffRandomNumberGenerator.hmtl}}
|
|
is a very simple random number generator that ``passes the noise sphere test
|
|
for randomness by showing no structure.''
|
|
It is easily programmed, in less than 10 lines of @command{awk} code:
|
|
|
|
@cindex @code{cliff_rand} user-defined function
|
|
@example
|
|
@c file eg/lib/cliff_rand.awk
|
|
# cliff_rand.awk --- generate Cliff random numbers
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/cliff_rand.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# December 2000
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/cliff_rand.awk
|
|
BEGIN @{ _cliff_seed = 0.1 @}
|
|
|
|
function cliff_rand()
|
|
@{
|
|
_cliff_seed = (100 * log(_cliff_seed)) % 1
|
|
if (_cliff_seed < 0)
|
|
_cliff_seed = - _cliff_seed
|
|
return _cliff_seed
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
This algorithm requires an initial ``seed'' of 0.1. Each new value
|
|
uses the current seed as input for the calculation.
|
|
If the built-in @code{rand} function
|
|
(@pxref{Numeric Functions})
|
|
isn't random enough, you might try using this function instead.
|
|
|
|
@node Ordinal Functions
|
|
@subsection Translating Between Characters and Numbers
|
|
|
|
@cindex libraries of @command{awk} functions, character values as numbers
|
|
@cindex functions, library, character values as numbers
|
|
@cindex characters, values of as numbers
|
|
@cindex numbers, as values of characters
|
|
One commercial implementation of @command{awk} supplies a built-in function,
|
|
@code{ord}, which takes a character and returns the numeric value for that
|
|
character in the machine's character set. If the string passed to
|
|
@code{ord} has more than one character, only the first one is used.
|
|
|
|
The inverse of this function is @code{chr} (from the function of the same
|
|
name in Pascal), which takes a number and returns the corresponding character.
|
|
Both functions are written very nicely in @command{awk}; there is no real
|
|
reason to build them into the @command{awk} interpreter:
|
|
|
|
@cindex @code{ord} user-defined function
|
|
@cindex @code{chr} user-defined function
|
|
@example
|
|
@c file eg/lib/ord.awk
|
|
# ord.awk --- do ord and chr
|
|
|
|
# Global identifiers:
|
|
# _ord_: numerical values indexed by characters
|
|
# _ord_init: function to initialize _ord_
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/ord.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# 16 January, 1992
|
|
# 20 July, 1992, revised
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/ord.awk
|
|
BEGIN @{ _ord_init() @}
|
|
|
|
function _ord_init( low, high, i, t)
|
|
@{
|
|
low = sprintf("%c", 7) # BEL is ascii 7
|
|
if (low == "\a") @{ # regular ascii
|
|
low = 0
|
|
high = 127
|
|
@} else if (sprintf("%c", 128 + 7) == "\a") @{
|
|
# ascii, mark parity
|
|
low = 128
|
|
high = 255
|
|
@} else @{ # ebcdic(!)
|
|
low = 0
|
|
high = 255
|
|
@}
|
|
|
|
for (i = low; i <= high; i++) @{
|
|
t = sprintf("%c", i)
|
|
_ord_[t] = i
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex character sets
|
|
@cindex character encodings
|
|
@cindex ASCII
|
|
@cindex EBCDIC
|
|
@cindex mark parity
|
|
Some explanation of the numbers used by @code{chr} is worthwhile.
|
|
The most prominent character set in use today is ASCII. Although an
|
|
8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
|
|
defines characters that use the values from 0 to 127.@footnote{ASCII
|
|
has been extended in many countries to use the values from 128 to 255
|
|
for country-specific characters. If your system uses these extensions,
|
|
you can simplify @code{_ord_init} to simply loop from 0 to 255.}
|
|
In the now distant past,
|
|
at least one minicomputer manufacturer
|
|
@c Pr1me, blech
|
|
used ASCII, but with mark parity, meaning that the leftmost bit in the byte
|
|
is always 1. This means that on those systems, characters
|
|
have numeric values from 128 to 255.
|
|
Finally, large mainframe systems use the EBCDIC character set, which
|
|
uses all 256 values.
|
|
While there are other character sets in use on some older systems,
|
|
they are not really worth worrying about:
|
|
|
|
@example
|
|
@c file eg/lib/ord.awk
|
|
function ord(str, c)
|
|
@{
|
|
# only first character is of interest
|
|
c = substr(str, 1, 1)
|
|
return _ord_[c]
|
|
@}
|
|
|
|
function chr(c)
|
|
@{
|
|
# force c to be numeric by adding 0
|
|
return sprintf("%c", c + 0)
|
|
@}
|
|
@c endfile
|
|
|
|
#### test code ####
|
|
# BEGIN \
|
|
# @{
|
|
# for (;;) @{
|
|
# printf("enter a character: ")
|
|
# if (getline var <= 0)
|
|
# break
|
|
# printf("ord(%s) = %d\n", var, ord(var))
|
|
# @}
|
|
# @}
|
|
@c endfile
|
|
@end example
|
|
|
|
An obvious improvement to these functions is to move the code for the
|
|
@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was
|
|
written this way initially for ease of development.
|
|
There is a ``test program'' in a @code{BEGIN} rule, to test the
|
|
function. It is commented out for production use.
|
|
|
|
@node Join Function
|
|
@subsection Merging an Array into a String
|
|
|
|
@cindex libraries of @command{awk} functions, merging arrays into strings
|
|
@cindex functions, library, merging arrays into strings
|
|
@cindex strings, merging arrays into
|
|
@cindex arrays, merging into strings
|
|
When doing string processing, it is often useful to be able to join
|
|
all the strings in an array into one long string. The following function,
|
|
@code{join}, accomplishes this task. It is used later in several of
|
|
the application programs
|
|
(@pxref{Sample Programs}).
|
|
|
|
Good function design is important; this function needs to be general but it
|
|
should also have a reasonable default behavior. It is called with an array
|
|
as well as the beginning and ending indices of the elements in the array to be
|
|
merged. This assumes that the array indices are numeric---a reasonable
|
|
assumption since the array was likely created with @code{split}
|
|
(@pxref{String Functions}):
|
|
|
|
@cindex @code{join} user-defined function
|
|
@example
|
|
@c file eg/lib/join.awk
|
|
# join.awk --- join an array into a string
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/join.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/join.awk
|
|
function join(array, start, end, sep, result, i)
|
|
@{
|
|
if (sep == "")
|
|
sep = " "
|
|
else if (sep == SUBSEP) # magic value
|
|
sep = ""
|
|
result = array[start]
|
|
for (i = start + 1; i <= end; i++)
|
|
result = result sep array[i]
|
|
return result
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
An optional additional argument is the separator to use when joining the
|
|
strings back together. If the caller supplies a nonempty value,
|
|
@code{join} uses it; if it is not supplied, it has a null
|
|
value. In this case, @code{join} uses a single blank as a default
|
|
separator for the strings. If the value is equal to @code{SUBSEP},
|
|
then @code{join} joins the strings with no separator between them.
|
|
@code{SUBSEP} serves as a ``magic'' value to indicate that there should
|
|
be no separation between the component strings.@footnote{It would
|
|
be nice if @command{awk} had an assignment operator for concatenation.
|
|
The lack of an explicit operator for concatenation makes string operations
|
|
more difficult than they really need to be.}
|
|
|
|
@node Gettimeofday Function
|
|
@subsection Managing the Time of Day
|
|
|
|
@cindex libraries of @command{awk} functions, managing, time
|
|
@cindex functions, library, managing time
|
|
@cindex timestamps, formatted
|
|
@cindex time, managing
|
|
The @code{systime} and @code{strftime} functions described in
|
|
@ref{Time Functions},
|
|
provide the minimum functionality necessary for dealing with the time of day
|
|
in human readable form. While @code{strftime} is extensive, the control
|
|
formats are not necessarily easy to remember or intuitively obvious when
|
|
reading a program.
|
|
|
|
The following function, @code{gettimeofday}, populates a user-supplied array
|
|
with preformatted time information. It returns a string with the current
|
|
time formatted in the same way as the @command{date} utility:
|
|
|
|
@cindex @code{gettimeofday} user-defined function
|
|
@example
|
|
@c file eg/lib/gettime.awk
|
|
# gettimeofday.awk --- get the time of day in a usable format
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/gettime.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993
|
|
#
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/gettime.awk
|
|
|
|
# Returns a string in the format of output of date(1)
|
|
# Populates the array argument time with individual values:
|
|
# time["second"] -- seconds (0 - 59)
|
|
# time["minute"] -- minutes (0 - 59)
|
|
# time["hour"] -- hours (0 - 23)
|
|
# time["althour"] -- hours (0 - 12)
|
|
# time["monthday"] -- day of month (1 - 31)
|
|
# time["month"] -- month of year (1 - 12)
|
|
# time["monthname"] -- name of the month
|
|
# time["shortmonth"] -- short name of the month
|
|
# time["year"] -- year modulo 100 (0 - 99)
|
|
# time["fullyear"] -- full year
|
|
# time["weekday"] -- day of week (Sunday = 0)
|
|
# time["altweekday"] -- day of week (Monday = 0)
|
|
# time["dayname"] -- name of weekday
|
|
# time["shortdayname"] -- short name of weekday
|
|
# time["yearday"] -- day of year (0 - 365)
|
|
# time["timezone"] -- abbreviation of timezone name
|
|
# time["ampm"] -- AM or PM designation
|
|
# time["weeknum"] -- week number, Sunday first day
|
|
# time["altweeknum"] -- week number, Monday first day
|
|
|
|
function gettimeofday(time, ret, now, i)
|
|
@{
|
|
# get time once, avoids unnecessary system calls
|
|
now = systime()
|
|
|
|
# return date(1)-style output
|
|
ret = strftime("%a %b %d %H:%M:%S %Z %Y", now)
|
|
|
|
# clear out target array
|
|
delete time
|
|
|
|
# fill in values, force numeric values to be
|
|
# numeric by adding 0
|
|
time["second"] = strftime("%S", now) + 0
|
|
time["minute"] = strftime("%M", now) + 0
|
|
time["hour"] = strftime("%H", now) + 0
|
|
time["althour"] = strftime("%I", now) + 0
|
|
time["monthday"] = strftime("%d", now) + 0
|
|
time["month"] = strftime("%m", now) + 0
|
|
time["monthname"] = strftime("%B", now)
|
|
time["shortmonth"] = strftime("%b", now)
|
|
time["year"] = strftime("%y", now) + 0
|
|
time["fullyear"] = strftime("%Y", now) + 0
|
|
time["weekday"] = strftime("%w", now) + 0
|
|
time["altweekday"] = strftime("%u", now) + 0
|
|
time["dayname"] = strftime("%A", now)
|
|
time["shortdayname"] = strftime("%a", now)
|
|
time["yearday"] = strftime("%j", now) + 0
|
|
time["timezone"] = strftime("%Z", now)
|
|
time["ampm"] = strftime("%p", now)
|
|
time["weeknum"] = strftime("%U", now) + 0
|
|
time["altweeknum"] = strftime("%W", now) + 0
|
|
|
|
return ret
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The string indices are easier to use and read than the various formats
|
|
required by @code{strftime}. The @code{alarm} program presented in
|
|
@ref{Alarm Program},
|
|
uses this function.
|
|
A more general design for the @code{gettimeofday} function would have
|
|
allowed the user to supply an optional timestamp value to use instead
|
|
of the current time.
|
|
|
|
@node Data File Management
|
|
@section @value{DDF} Management
|
|
|
|
@c STARTOFRANGE dataf
|
|
@cindex files, managing
|
|
@c STARTOFRANGE libfdataf
|
|
@cindex libraries of @command{awk} functions, managing, @value{DF}s
|
|
@c STARTOFRANGE flibdataf
|
|
@cindex functions, library, managing @value{DF}s
|
|
This @value{SECTION} presents functions that are useful for managing
|
|
command-line @value{DF}s.
|
|
|
|
@menu
|
|
* Filetrans Function:: A function for handling data file transitions.
|
|
* Rewind Function:: A function for rereading the current file.
|
|
* File Checking:: Checking that data files are readable.
|
|
* Empty Files:: Checking for zero-length files.
|
|
* Ignoring Assigns:: Treating assignments as file names.
|
|
@end menu
|
|
|
|
@node Filetrans Function
|
|
@subsection Noting @value{DDF} Boundaries
|
|
|
|
@cindex files, managing, @value{DF} boundaries
|
|
@cindex files, initialization and cleanup
|
|
The @code{BEGIN} and @code{END} rules are each executed exactly once at
|
|
the beginning and end of your @command{awk} program, respectively
|
|
(@pxref{BEGIN/END}).
|
|
We (the @command{gawk} authors) once had a user who mistakenly thought that the
|
|
@code{BEGIN} rule is executed at the beginning of each @value{DF} and the
|
|
@code{END} rule is executed at the end of each @value{DF}. When informed
|
|
that this was not the case, the user requested that we add new special
|
|
patterns to @command{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
|
|
would have the desired behavior. He even supplied us the code to do so.
|
|
|
|
Adding these special patterns to @command{gawk} wasn't necessary;
|
|
the job can be done cleanly in @command{awk} itself, as illustrated
|
|
by the following library program.
|
|
It arranges to call two user-supplied functions, @code{beginfile} and
|
|
@code{endfile}, at the beginning and end of each @value{DF}.
|
|
Besides solving the problem in only nine(!) lines of code, it does so
|
|
@emph{portably}; this works with any implementation of @command{awk}:
|
|
|
|
@example
|
|
# transfile.awk
|
|
#
|
|
# Give the user a hook for filename transitions
|
|
#
|
|
# The user must supply functions beginfile() and endfile()
|
|
# that each take the name of the file being started or
|
|
# finished, respectively.
|
|
@c #
|
|
@c # Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
@c # January 1992
|
|
|
|
FILENAME != _oldfilename \
|
|
@{
|
|
if (_oldfilename != "")
|
|
endfile(_oldfilename)
|
|
_oldfilename = FILENAME
|
|
beginfile(FILENAME)
|
|
@}
|
|
|
|
END @{ endfile(FILENAME) @}
|
|
@end example
|
|
|
|
This file must be loaded before the user's ``main'' program, so that the
|
|
rule it supplies is executed first.
|
|
|
|
This rule relies on @command{awk}'s @code{FILENAME} variable that
|
|
automatically changes for each new @value{DF}. The current @value{FN} is
|
|
saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does
|
|
not equal @code{_oldfilename}, then a new @value{DF} is being processed and
|
|
it is necessary to call @code{endfile} for the old file. Because
|
|
@code{endfile} should only be called if a file has been processed, the
|
|
program first checks to make sure that @code{_oldfilename} is not the null
|
|
string. The program then assigns the current @value{FN} to
|
|
@code{_oldfilename} and calls @code{beginfile} for the file.
|
|
Because, like all @command{awk} variables, @code{_oldfilename} is
|
|
initialized to the null string, this rule executes correctly even for the
|
|
first @value{DF}.
|
|
|
|
The program also supplies an @code{END} rule to do the final processing for
|
|
the last file. Because this @code{END} rule comes before any @code{END} rules
|
|
supplied in the ``main'' program, @code{endfile} is called first. Once
|
|
again the value of multiple @code{BEGIN} and @code{END} rules should be clear.
|
|
|
|
@cindex @code{beginfile} user-defined function
|
|
@cindex @code{endfile} user-defined function
|
|
This version has same problem as the first version of @code{nextfile}
|
|
(@pxref{Nextfile Function}).
|
|
If the same @value{DF} occurs twice in a row on the command line, then
|
|
@code{endfile} and @code{beginfile} are not executed at the end of the
|
|
first pass and at the beginning of the second pass.
|
|
The following version solves the problem:
|
|
|
|
@example
|
|
@c file eg/lib/ftrans.awk
|
|
# ftrans.awk --- handle data file transitions
|
|
#
|
|
# user supplies beginfile() and endfile() functions
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/ftrans.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# November 1992
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/ftrans.awk
|
|
FNR == 1 @{
|
|
if (_filename_ != "")
|
|
endfile(_filename_)
|
|
_filename_ = FILENAME
|
|
beginfile(FILENAME)
|
|
@}
|
|
|
|
END @{ endfile(_filename_) @}
|
|
@c endfile
|
|
@end example
|
|
|
|
@ref{Wc Program},
|
|
shows how this library function can be used and
|
|
how it simplifies writing the main program.
|
|
|
|
@node Rewind Function
|
|
@subsection Rereading the Current File
|
|
|
|
@cindex files, reading
|
|
Another request for a new built-in function was for a @code{rewind}
|
|
function that would make it possible to reread the current file.
|
|
The requesting user didn't want to have to use @code{getline}
|
|
(@pxref{Getline})
|
|
inside a loop.
|
|
|
|
However, as long as you are not in the @code{END} rule, it is
|
|
quite easy to arrange to immediately close the current input file
|
|
and then start over with it from the top.
|
|
For lack of a better name, we'll call it @code{rewind}:
|
|
|
|
@cindex @code{rewind} user-defined function
|
|
@example
|
|
@c file eg/lib/rewind.awk
|
|
# rewind.awk --- rewind the current file and start over
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/rewind.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# September 2000
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/rewind.awk
|
|
function rewind( i)
|
|
@{
|
|
# shift remaining arguments up
|
|
for (i = ARGC; i > ARGIND; i--)
|
|
ARGV[i] = ARGV[i-1]
|
|
|
|
# make sure gawk knows to keep going
|
|
ARGC++
|
|
|
|
# make current file next to get done
|
|
ARGV[ARGIND+1] = FILENAME
|
|
|
|
# do it
|
|
nextfile
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
This code relies on the @code{ARGIND} variable
|
|
(@pxref{Auto-set}),
|
|
which is specific to @command{gawk}.
|
|
If you are not using
|
|
@command{gawk}, you can use ideas presented in
|
|
@ifnotinfo
|
|
the previous @value{SECTION}
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
@ref{Filetrans Function},
|
|
@end ifinfo
|
|
to either update @code{ARGIND} on your own
|
|
or modify this code as appropriate.
|
|
|
|
The @code{rewind} function also relies on the @code{nextfile} keyword
|
|
(@pxref{Nextfile Statement}).
|
|
@xref{Nextfile Function},
|
|
for a function version of @code{nextfile}.
|
|
|
|
@node File Checking
|
|
@subsection Checking for Readable @value{DDF}s
|
|
|
|
@cindex troubleshooting, readable @value{DF}s
|
|
@c comma is part of primary
|
|
@cindex readable @value{DF}s, checking
|
|
@cindex files, skipping
|
|
Normally, if you give @command{awk} a @value{DF} that isn't readable,
|
|
it stops with a fatal error. There are times when you
|
|
might want to just ignore such files and keep going. You can
|
|
do this by prepending the following program to your @command{awk}
|
|
program:
|
|
|
|
@cindex @code{readable.awk} program
|
|
@example
|
|
@c file eg/lib/readable.awk
|
|
# readable.awk --- library file to skip over unreadable files
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/readable.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# October 2000
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/readable.awk
|
|
BEGIN @{
|
|
for (i = 1; i < ARGC; i++) @{
|
|
if (ARGV[i] ~ /^[A-Za-z_][A-Za-z0-9_]*=.*/ \
|
|
|| ARGV[i] == "-")
|
|
continue # assignment or standard input
|
|
else if ((getline junk < ARGV[i]) < 0) # unreadable
|
|
delete ARGV[i]
|
|
else
|
|
close(ARGV[i])
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex troubleshooting, @code{getline} function
|
|
In @command{gawk}, the @code{getline} won't be fatal (unless
|
|
@option{--posix} is in force).
|
|
Removing the element from @code{ARGV} with @code{delete}
|
|
skips the file (since it's no longer in the list).
|
|
|
|
@c This doesn't handle /dev/stdin etc. Not worth the hassle to mention or fix.
|
|
|
|
@node Empty Files
|
|
@subsection Checking For Zero-length Files
|
|
|
|
All known @command{awk} implementations silently skip over zero-length files.
|
|
This is a by-product of @command{awk}'s implicit
|
|
read-a-record-and-match-against-the-rules loop: when @command{awk}
|
|
tries to read a record from an empty file, it immediately receives an
|
|
end of file indication, closes the file, and proceeds on to the next
|
|
command-line @value{DF}, @emph{without} executing any user-level
|
|
@command{awk} program code.
|
|
|
|
Using @command{gawk}'s @code{ARGIND} variable
|
|
(@pxref{Built-in Variables}), it is possible to detect when an empty
|
|
@value{DF} has been skipped. Similar to the library file presented
|
|
in @ref{Filetrans Function}, the following library file calls a function named
|
|
@code{zerofile} that the user must provide. The arguments passed are
|
|
the @value{FN} and the position in @code{ARGV} where it was found:
|
|
|
|
@cindex @code{zerofile.awk} program
|
|
@example
|
|
@c file eg/lib/zerofile.awk
|
|
# zerofile.awk --- library file to process empty input files
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/zerofile.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# June 2003
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/zerofile.awk
|
|
BEGIN @{ Argind = 0 @}
|
|
|
|
ARGIND > Argind + 1 @{
|
|
for (Argind++; Argind < ARGIND; Argind++)
|
|
zerofile(ARGV[Argind], Argind)
|
|
@}
|
|
|
|
ARGIND != Argind @{ Argind = ARGIND @}
|
|
|
|
END @{
|
|
if (ARGIND > Argind)
|
|
for (Argind++; Argind <= ARGIND; Argind++)
|
|
zerofile(ARGV[Argind], Argind)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The user-level variable @code{Argind} allows the @command{awk} program
|
|
to track its progress through @code{ARGV}. Whenever the program detects
|
|
that @code{ARGIND} is greater than @samp{Argind + 1}, it means that one or
|
|
more empty files were skipped. The action then calls @code{zerofile} for
|
|
each such file, incrementing @code{Argind} along the way.
|
|
|
|
The @samp{Argind != ARGIND} rule simply keeps @code{Argind} up to date
|
|
in the normal case.
|
|
|
|
Finally, the @code{END} rule catches the case of any empty files at
|
|
the end of the command-line arguments. Note that the test in the
|
|
condition of the @code{for} loop uses the @samp{<=} operator,
|
|
not @code{<}.
|
|
|
|
As an exercise, you might consider whether this same problem can
|
|
be solved without relying on @command{gawk}'s @code{ARGIND} variable.
|
|
|
|
As a second exercise, revise this code to handle the case where
|
|
an intervening value in @code{ARGV} is a variable assignment.
|
|
|
|
@ignore
|
|
# zerofile2.awk --- same thing, portably
|
|
BEGIN @{
|
|
ARGIND = Argind = 0
|
|
for (i = 1; i < ARGC; i++)
|
|
Fnames[ARGV[i]]++
|
|
|
|
@}
|
|
FNR == 1 @{
|
|
while (ARGV[ARGIND] != FILENAME)
|
|
ARGIND++
|
|
Seen[FILENAME]++
|
|
if (Seen[FILENAME] == Fnames[FILENAME])
|
|
do
|
|
ARGIND++
|
|
while (ARGV[ARGIND] != FILENAME)
|
|
@}
|
|
ARGIND > Argind + 1 @{
|
|
for (Argind++; Argind < ARGIND; Argind++)
|
|
zerofile(ARGV[Argind], Argind)
|
|
@}
|
|
ARGIND != Argind @{
|
|
Argind = ARGIND
|
|
@}
|
|
END @{
|
|
if (ARGIND < ARGC - 1)
|
|
ARGIND = ARGC - 1
|
|
if (ARGIND > Argind)
|
|
for (Argind++; Argind <= ARGIND; Argind++)
|
|
zerofile(ARGV[Argind], Argind)
|
|
@}
|
|
@end ignore
|
|
|
|
@node Ignoring Assigns
|
|
@subsection Treating Assignments as @value{FFN}s
|
|
|
|
@cindex assignments as filenames
|
|
@cindex filenames, assignments as
|
|
Occasionally, you might not want @command{awk} to process command-line
|
|
variable assignments
|
|
(@pxref{Assignment Options}).
|
|
In particular, if you have @value{FN}s that contain an @samp{=} character,
|
|
@command{awk} treats the @value{FN} as an assignment, and does not process it.
|
|
|
|
Some users have suggested an additional command-line option for @command{gawk}
|
|
to disable command-line assignments. However, some simple programming with
|
|
a library file does the trick:
|
|
|
|
@cindex @code{noassign.awk} program
|
|
@example
|
|
@c file eg/lib/noassign.awk
|
|
# noassign.awk --- library file to avoid the need for a
|
|
# special option that disables command-line assignments
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/noassign.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# October 1999
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/noassign.awk
|
|
function disable_assigns(argc, argv, i)
|
|
@{
|
|
for (i = 1; i < argc; i++)
|
|
if (argv[i] ~ /^[A-Za-z_][A-Za-z_0-9]*=.*/)
|
|
argv[i] = ("./" argv[i])
|
|
@}
|
|
|
|
BEGIN @{
|
|
if (No_command_assign)
|
|
disable_assigns(ARGC, ARGV)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
You then run your program this way:
|
|
|
|
@example
|
|
awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *
|
|
@end example
|
|
|
|
The function works by looping through the arguments.
|
|
It prepends @samp{./} to
|
|
any argument that matches the form
|
|
of a variable assignment, turning that argument into a @value{FN}.
|
|
|
|
The use of @code{No_command_assign} allows you to disable command-line
|
|
assignments at invocation time, by giving the variable a true value.
|
|
When not set, it is initially zero (i.e., false), so the command-line arguments
|
|
are left alone.
|
|
@c ENDOFRANGE dataf
|
|
@c ENDOFRANGE flibdataf
|
|
@c ENDOFRANGE libfdataf
|
|
|
|
@node Getopt Function
|
|
@section Processing Command-Line Options
|
|
|
|
@c STARTOFRANGE libfclo
|
|
@cindex libraries of @command{awk} functions, command-line options
|
|
@c STARTOFRANGE flibclo
|
|
@cindex functions, library, command-line options
|
|
@c STARTOFRANGE clop
|
|
@cindex command-line options, processing
|
|
@c STARTOFRANGE oclp
|
|
@cindex options, command-line, processing
|
|
@c STARTOFRANGE clibf
|
|
@cindex functions, library, C library
|
|
@cindex arguments, processing
|
|
Most utilities on POSIX compatible systems take options, or ``switches,'' on
|
|
the command line that can be used to change the way a program behaves.
|
|
@command{awk} is an example of such a program
|
|
(@pxref{Options}).
|
|
Often, options take @dfn{arguments}; i.e., data that the program needs to
|
|
correctly obey the command-line option. For example, @command{awk}'s
|
|
@option{-F} option requires a string to use as the field separator.
|
|
The first occurrence on the command line of either @option{--} or a
|
|
string that does not begin with @samp{-} ends the options.
|
|
|
|
@cindex @code{getopt} function (C library)
|
|
Modern Unix systems provide a C function named @code{getopt} for processing
|
|
command-line arguments. The programmer provides a string describing the
|
|
one-letter options. If an option requires an argument, it is followed in the
|
|
string with a colon. @code{getopt} is also passed the
|
|
count and values of the command-line arguments and is called in a loop.
|
|
@code{getopt} processes the command-line arguments for option letters.
|
|
Each time around the loop, it returns a single character representing the
|
|
next option letter that it finds, or @samp{?} if it finds an invalid option.
|
|
When it returns @minus{}1, there are no options left on the command line.
|
|
|
|
When using @code{getopt}, options that do not take arguments can be
|
|
grouped together. Furthermore, options that take arguments require that the
|
|
argument is present. The argument can immediately follow the option letter,
|
|
or it can be a separate command-line argument.
|
|
|
|
Given a hypothetical program that takes
|
|
three command-line options, @option{-a}, @option{-b}, and @option{-c}, where
|
|
@option{-b} requires an argument, all of the following are valid ways of
|
|
invoking the program:
|
|
|
|
@example
|
|
prog -a -b foo -c data1 data2 data3
|
|
prog -ac -bfoo -- data1 data2 data3
|
|
prog -acbfoo data1 data2 data3
|
|
@end example
|
|
|
|
Notice that when the argument is grouped with its option, the rest of
|
|
the argument is considered to be the option's argument.
|
|
In this example, @option{-acbfoo} indicates that all of the
|
|
@option{-a}, @option{-b}, and @option{-c} options were supplied,
|
|
and that @samp{foo} is the argument to the @option{-b} option.
|
|
|
|
@code{getopt} provides four external variables that the programmer can use:
|
|
|
|
@table @code
|
|
@item optind
|
|
The index in the argument value array (@code{argv}) where the first
|
|
nonoption command-line argument can be found.
|
|
|
|
@item optarg
|
|
The string value of the argument to an option.
|
|
|
|
@item opterr
|
|
Usually @code{getopt} prints an error message when it finds an invalid
|
|
option. Setting @code{opterr} to zero disables this feature. (An
|
|
application might want to print its own error message.)
|
|
|
|
@item optopt
|
|
The letter representing the command-line option.
|
|
@c While not usually documented, most versions supply this variable.
|
|
@end table
|
|
|
|
The following C fragment shows how @code{getopt} might process command-line
|
|
arguments for @command{awk}:
|
|
|
|
@example
|
|
int
|
|
main(int argc, char *argv[])
|
|
@{
|
|
@dots{}
|
|
/* print our own message */
|
|
opterr = 0;
|
|
while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
|
|
switch (c) @{
|
|
case 'f': /* file */
|
|
@dots{}
|
|
break;
|
|
case 'F': /* field separator */
|
|
@dots{}
|
|
break;
|
|
case 'v': /* variable assignment */
|
|
@dots{}
|
|
break;
|
|
case 'W': /* extension */
|
|
@dots{}
|
|
break;
|
|
case '?':
|
|
default:
|
|
usage();
|
|
break;
|
|
@}
|
|
@}
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
As a side point, @command{gawk} actually uses the GNU @code{getopt_long}
|
|
function to process both normal and GNU-style long options
|
|
(@pxref{Options}).
|
|
|
|
The abstraction provided by @code{getopt} is very useful and is quite
|
|
handy in @command{awk} programs as well. Following is an @command{awk}
|
|
version of @code{getopt}. This function highlights one of the
|
|
greatest weaknesses in @command{awk}, which is that it is very poor at
|
|
manipulating single characters. Repeated calls to @code{substr} are
|
|
necessary for accessing individual characters
|
|
(@pxref{String Functions}).@footnote{This
|
|
function was written before @command{gawk} acquired the ability to
|
|
split strings into single characters using @code{""} as the separator.
|
|
We have left it alone, since using @code{substr} is more portable.}
|
|
|
|
The discussion that follows walks through the code a bit at a time:
|
|
|
|
@cindex @code{getopt} user-defined function
|
|
@example
|
|
@c file eg/lib/getopt.awk
|
|
# getopt.awk --- do C library getopt(3) function in awk
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/getopt.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
#
|
|
# Initial version: March, 1991
|
|
# Revised: May, 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/getopt.awk
|
|
# External variables:
|
|
# Optind -- index in ARGV of first nonoption argument
|
|
# Optarg -- string value of argument to current option
|
|
# Opterr -- if nonzero, print our own diagnostic
|
|
# Optopt -- current option letter
|
|
|
|
# Returns:
|
|
# -1 at end of options
|
|
# ? for unrecognized option
|
|
# <c> a character representing the current option
|
|
|
|
# Private Data:
|
|
# _opti -- index in multi-flag option, e.g., -abc
|
|
@c endfile
|
|
@end example
|
|
|
|
The function starts out with
|
|
a list of the global variables it uses,
|
|
what the return values are, what they mean, and any global variables that
|
|
are ``private'' to this library function. Such documentation is essential
|
|
for any program, and particularly for library functions.
|
|
|
|
The @code{getopt} function first checks that it was indeed called with a string of options
|
|
(the @code{options} parameter). If @code{options} has a zero length,
|
|
@code{getopt} immediately returns @minus{}1:
|
|
|
|
@cindex @code{getopt} user-defined function
|
|
@example
|
|
@c file eg/lib/getopt.awk
|
|
function getopt(argc, argv, options, thisopt, i)
|
|
@{
|
|
if (length(options) == 0) # no options given
|
|
return -1
|
|
|
|
@group
|
|
if (argv[Optind] == "--") @{ # all done
|
|
Optind++
|
|
_opti = 0
|
|
return -1
|
|
@end group
|
|
@} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{
|
|
_opti = 0
|
|
return -1
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The next thing to check for is the end of the options. A @option{--}
|
|
ends the command-line options, as does any command-line argument that
|
|
does not begin with a @samp{-}. @code{Optind} is used to step through
|
|
the array of command-line arguments; it retains its value across calls
|
|
to @code{getopt}, because it is a global variable.
|
|
|
|
The regular expression that is used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is
|
|
perhaps a bit of overkill; it checks for a @samp{-} followed by anything
|
|
that is not whitespace and not a colon.
|
|
If the current command-line argument does not match this pattern,
|
|
it is not an option, and it ends option processing:
|
|
|
|
@example
|
|
@c file eg/lib/getopt.awk
|
|
if (_opti == 0)
|
|
_opti = 2
|
|
thisopt = substr(argv[Optind], _opti, 1)
|
|
Optopt = thisopt
|
|
i = index(options, thisopt)
|
|
if (i == 0) @{
|
|
if (Opterr)
|
|
printf("%c -- invalid option\n",
|
|
thisopt) > "/dev/stderr"
|
|
if (_opti >= length(argv[Optind])) @{
|
|
Optind++
|
|
_opti = 0
|
|
@} else
|
|
_opti++
|
|
return "?"
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{_opti} variable tracks the position in the current command-line
|
|
argument (@code{argv[Optind]}). If multiple options are
|
|
grouped together with one @samp{-} (e.g., @option{-abx}), it is necessary
|
|
to return them to the user one at a time.
|
|
|
|
If @code{_opti} is equal to zero, it is set to two, which is the index in
|
|
the string of the next character to look at (we skip the @samp{-}, which
|
|
is at position one). The variable @code{thisopt} holds the character,
|
|
obtained with @code{substr}. It is saved in @code{Optopt} for the main
|
|
program to use.
|
|
|
|
If @code{thisopt} is not in the @code{options} string, then it is an
|
|
invalid option. If @code{Opterr} is nonzero, @code{getopt} prints an error
|
|
message on the standard error that is similar to the message from the C
|
|
version of @code{getopt}.
|
|
|
|
Because the option is invalid, it is necessary to skip it and move on to the
|
|
next option character. If @code{_opti} is greater than or equal to the
|
|
length of the current command-line argument, it is necessary to move on
|
|
to the next argument, so @code{Optind} is incremented and @code{_opti} is reset
|
|
to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
|
|
incremented.
|
|
|
|
In any case, because the option is invalid, @code{getopt} returns @samp{?}.
|
|
The main program can examine @code{Optopt} if it needs to know what the
|
|
invalid option letter actually is. Continuing on:
|
|
|
|
@example
|
|
@c file eg/lib/getopt.awk
|
|
if (substr(options, i + 1, 1) == ":") @{
|
|
# get option argument
|
|
if (length(substr(argv[Optind], _opti + 1)) > 0)
|
|
Optarg = substr(argv[Optind], _opti + 1)
|
|
else
|
|
Optarg = argv[++Optind]
|
|
_opti = 0
|
|
@} else
|
|
Optarg = ""
|
|
@c endfile
|
|
@end example
|
|
|
|
If the option requires an argument, the option letter is followed by a colon
|
|
in the @code{options} string. If there are remaining characters in the
|
|
current command-line argument (@code{argv[Optind]}), then the rest of that
|
|
string is assigned to @code{Optarg}. Otherwise, the next command-line
|
|
argument is used (@samp{-xFOO} versus @samp{@w{-x FOO}}). In either case,
|
|
@code{_opti} is reset to zero, because there are no more characters left to
|
|
examine in the current command-line argument. Continuing:
|
|
|
|
@example
|
|
@c file eg/lib/getopt.awk
|
|
if (_opti == 0 || _opti >= length(argv[Optind])) @{
|
|
Optind++
|
|
_opti = 0
|
|
@} else
|
|
_opti++
|
|
return thisopt
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
Finally, if @code{_opti} is either zero or greater than the length of the
|
|
current command-line argument, it means this element in @code{argv} is
|
|
through being processed, so @code{Optind} is incremented to point to the
|
|
next element in @code{argv}. If neither condition is true, then only
|
|
@code{_opti} is incremented, so that the next option letter can be processed
|
|
on the next call to @code{getopt}.
|
|
|
|
The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
|
|
@code{Opterr} is set to one, since the default behavior is for @code{getopt}
|
|
to print a diagnostic message upon seeing an invalid option. @code{Optind}
|
|
is set to one, since there's no reason to look at the program name, which is
|
|
in @code{ARGV[0]}:
|
|
|
|
@example
|
|
@c file eg/lib/getopt.awk
|
|
BEGIN @{
|
|
Opterr = 1 # default is to diagnose
|
|
Optind = 1 # skip ARGV[0]
|
|
|
|
# test program
|
|
if (_getopt_test) @{
|
|
while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
|
|
printf("c = <%c>, optarg = <%s>\n",
|
|
_go_c, Optarg)
|
|
printf("non-option arguments:\n")
|
|
for (; Optind < ARGC; Optind++)
|
|
printf("\tARGV[%d] = <%s>\n",
|
|
Optind, ARGV[Optind])
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The rest of the @code{BEGIN} rule is a simple test program. Here is the
|
|
result of two sample runs of the test program:
|
|
|
|
@example
|
|
$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
|
|
@print{} c = <a>, optarg = <>
|
|
@print{} c = <c>, optarg = <>
|
|
@print{} c = <b>, optarg = <ARG>
|
|
@print{} non-option arguments:
|
|
@print{} ARGV[3] = <bax>
|
|
@print{} ARGV[4] = <-x>
|
|
|
|
$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
|
|
@print{} c = <a>, optarg = <>
|
|
@error{} x -- invalid option
|
|
@print{} c = <?>, optarg = <>
|
|
@print{} non-option arguments:
|
|
@print{} ARGV[4] = <xyz>
|
|
@print{} ARGV[5] = <abc>
|
|
@end example
|
|
|
|
In both runs,
|
|
the first @option{--} terminates the arguments to @command{awk}, so that it does
|
|
not try to interpret the @option{-a}, etc., as its own options.
|
|
Several of the sample programs presented in
|
|
@ref{Sample Programs},
|
|
use @code{getopt} to process their arguments.
|
|
@c ENDOFRANGE libfclo
|
|
@c ENDOFRANGE flibclo
|
|
@c ENDOFRANGE clop
|
|
@c ENDOFRANGE oclp
|
|
|
|
@node Passwd Functions
|
|
@section Reading the User Database
|
|
|
|
@c STARTOFRANGE libfudata
|
|
@cindex libraries of @command{awk} functions, user database, reading
|
|
@c STARTOFRANGE flibudata
|
|
@cindex functions, library, user database, reading
|
|
@c last comma is part of primary
|
|
@c STARTOFRANGE udatar
|
|
@cindex user database, reading
|
|
@c last comma is part of secondary
|
|
@c STARTOFRANGE dataur
|
|
@cindex database, users, reading
|
|
@cindex @code{PROCINFO} array
|
|
The @code{PROCINFO} array
|
|
(@pxref{Built-in Variables})
|
|
provides access to the current user's real and effective user and group ID
|
|
numbers, and if available, the user's supplementary group set.
|
|
However, because these are numbers, they do not provide very useful
|
|
information to the average user. There needs to be some way to find the
|
|
user information associated with the user and group ID numbers. This
|
|
@value{SECTION} presents a suite of functions for retrieving information from the
|
|
user database. @xref{Group Functions},
|
|
for a similar suite that retrieves information from the group database.
|
|
|
|
@cindex @code{getpwent} function (C library)
|
|
@cindex @code{getpwent} user-defined function
|
|
@cindex users, information about, retrieving
|
|
@cindex login information
|
|
@cindex account information
|
|
@cindex password file
|
|
@cindex files, password
|
|
The POSIX standard does not define the file where user information is
|
|
kept. Instead, it provides the @code{<pwd.h>} header file
|
|
and several C language subroutines for obtaining user information.
|
|
The primary function is @code{getpwent}, for ``get password entry.''
|
|
The ``password'' comes from the original user database file,
|
|
@file{/etc/passwd}, which stores user information, along with the
|
|
encrypted passwords (hence the name).
|
|
|
|
@cindex @command{pwcat} program
|
|
While an @command{awk} program could simply read @file{/etc/passwd}
|
|
directly, this file may not contain complete information about the
|
|
system's set of users.@footnote{It is often the case that password
|
|
information is stored in a network database.} To be sure you are able to
|
|
produce a readable and complete version of the user database, it is necessary
|
|
to write a small C program that calls @code{getpwent}. @code{getpwent}
|
|
is defined as returning a pointer to a @code{struct passwd}. Each time it
|
|
is called, it returns the next entry in the database. When there are
|
|
no more entries, it returns @code{NULL}, the null pointer. When this
|
|
happens, the C program should call @code{endpwent} to close the database.
|
|
Following is @command{pwcat}, a C program that ``cats'' the password database:
|
|
|
|
@c Use old style function header for portability to old systems (SunOS, HP/UX).
|
|
|
|
@example
|
|
@c file eg/lib/pwcat.c
|
|
/*
|
|
* pwcat.c
|
|
*
|
|
* Generate a printable version of the password database
|
|
*/
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/pwcat.c
|
|
/*
|
|
* Arnold Robbins, arnold@@gnu.org, May 1993
|
|
* Public Domain
|
|
*/
|
|
|
|
#if HAVE_CONFIG_H
|
|
#include <config.h>
|
|
#endif
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/pwcat.c
|
|
#include <stdio.h>
|
|
#include <pwd.h>
|
|
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/pwcat.c
|
|
#if defined (STDC_HEADERS)
|
|
#include <stdlib.h>
|
|
#endif
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/pwcat.c
|
|
int
|
|
main(argc, argv)
|
|
int argc;
|
|
char **argv;
|
|
@{
|
|
struct passwd *p;
|
|
|
|
while ((p = getpwent()) != NULL)
|
|
printf("%s:%s:%ld:%ld:%s:%s:%s\n",
|
|
p->pw_name, p->pw_passwd, (long) p->pw_uid,
|
|
(long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
|
|
|
|
endpwent();
|
|
return 0;
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
If you don't understand C, don't worry about it.
|
|
The output from @command{pwcat} is the user database, in the traditional
|
|
@file{/etc/passwd} format of colon-separated fields. The fields are:
|
|
|
|
@ignore
|
|
@table @asis
|
|
@item Login name
|
|
The user's login name.
|
|
|
|
@item Encrypted password
|
|
The user's encrypted password. This may not be available on some systems.
|
|
|
|
@item User-ID
|
|
The user's numeric user ID number.
|
|
(On some systems it's a C @code{long}, and not an @code{int}. Thus
|
|
we cast it to @code{long} for all cases.)
|
|
|
|
@item Group-ID
|
|
The user's numeric group ID number.
|
|
(Similar comments about @code{long} vs.@: @code{int} apply here.)
|
|
|
|
@item Full name
|
|
The user's full name, and perhaps other information associated with the
|
|
user.
|
|
|
|
@item Home directory
|
|
The user's login (or ``home'') directory (familiar to shell programmers as
|
|
@code{$HOME}).
|
|
|
|
@item Login shell
|
|
The program that is run when the user logs in. This is usually a
|
|
shell, such as @command{bash}.
|
|
@end table
|
|
@end ignore
|
|
|
|
@multitable {Encrypted password} {1234567890123456789012345678901234567890123456}
|
|
@item Login name @tab The user's login name.
|
|
|
|
@item Encrypted password @tab The user's encrypted password. This may not be available on some systems.
|
|
|
|
@item User-ID @tab The user's numeric user ID number.
|
|
|
|
@item Group-ID @tab The user's numeric group ID number.
|
|
|
|
@item Full name @tab The user's full name, and perhaps other information associated with the
|
|
user.
|
|
|
|
@item Home directory @tab The user's login (or ``home'') directory (familiar to shell programmers as
|
|
@code{$HOME}).
|
|
|
|
@item Login shell @tab The program that is run when the user logs in. This is usually a
|
|
shell, such as @command{bash}.
|
|
@end multitable
|
|
|
|
A few lines representative of @command{pwcat}'s output are as follows:
|
|
|
|
@cindex Jacobs, Andrew
|
|
@cindex Robbins, Arnold
|
|
@cindex Robbins, Miriam
|
|
@example
|
|
$ pwcat
|
|
@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
|
|
@print{} nobody:*:65534:65534::/:
|
|
@print{} daemon:*:1:1::/:
|
|
@print{} sys:*:2:2::/:/bin/csh
|
|
@print{} bin:*:3:3::/bin:
|
|
@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
|
|
@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
|
|
@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
|
|
@dots{}
|
|
@end example
|
|
|
|
With that introduction, following is a group of functions for getting user
|
|
information. There are several functions here, corresponding to the C
|
|
functions of the same names:
|
|
|
|
@c Exercise: simplify all these functions that return values.
|
|
@c Answer: return foo[key] returns "" if key not there, no need to check with `in'.
|
|
|
|
@cindex @code{_pw_init} user-defined function
|
|
@example
|
|
@c file eg/lib/passwdawk.in
|
|
# passwd.awk --- access password file information
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/passwdawk.in
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
# Revised October 2000
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/passwdawk.in
|
|
BEGIN @{
|
|
# tailor this to suit your system
|
|
_pw_awklib = "/usr/local/libexec/awk/"
|
|
@}
|
|
|
|
function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw)
|
|
@{
|
|
if (_pw_inited)
|
|
return
|
|
|
|
oldfs = FS
|
|
oldrs = RS
|
|
olddol0 = $0
|
|
using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
|
|
FS = ":"
|
|
RS = "\n"
|
|
|
|
pwcat = _pw_awklib "pwcat"
|
|
while ((pwcat | getline) > 0) @{
|
|
_pw_byname[$1] = $0
|
|
_pw_byuid[$3] = $0
|
|
_pw_bycount[++_pw_total] = $0
|
|
@}
|
|
close(pwcat)
|
|
_pw_count = 0
|
|
_pw_inited = 1
|
|
FS = oldfs
|
|
if (using_fw)
|
|
FIELDWIDTHS = FIELDWIDTHS
|
|
RS = oldrs
|
|
$0 = olddol0
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{BEGIN} pattern, @code{pwcat} program
|
|
The @code{BEGIN} rule sets a private variable to the directory where
|
|
@command{pwcat} is stored. Because it is used to help out an @command{awk} library
|
|
routine, we have chosen to put it in @file{/usr/local/libexec/awk};
|
|
however, you might want it to be in a different directory on your system.
|
|
|
|
The function @code{_pw_init} keeps three copies of the user information
|
|
in three associative arrays. The arrays are indexed by username
|
|
(@code{_pw_byname}), by user ID number (@code{_pw_byuid}), and by order of
|
|
occurrence (@code{_pw_bycount}).
|
|
The variable @code{_pw_inited} is used for efficiency; @code{_pw_init}
|
|
needs only to be called once.
|
|
|
|
@cindex @code{getline} command, @code{_pw_init} function
|
|
Because this function uses @code{getline} to read information from
|
|
@command{pwcat}, it first saves the values of @code{FS}, @code{RS}, and @code{$0}.
|
|
It notes in the variable @code{using_fw} whether field splitting
|
|
with @code{FIELDWIDTHS} is in effect or not.
|
|
Doing so is necessary, since these functions could be called
|
|
from anywhere within a user's program, and the user may have his
|
|
or her
|
|
own way of splitting records and fields.
|
|
|
|
The @code{using_fw} variable checks @code{PROCINFO["FS"]}, which
|
|
is @code{"FIELDWIDTHS"} if field splitting is being done with
|
|
@code{FIELDWIDTHS}. This makes it possible to restore the correct
|
|
field-splitting mechanism later. The test can only be true for
|
|
@command{gawk}. It is false if using @code{FS} or on some other
|
|
@command{awk} implementation.
|
|
|
|
The main part of the function uses a loop to read database lines, split
|
|
the line into fields, and then store the line into each array as necessary.
|
|
When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline,
|
|
setting @code{@w{_pw_inited}} to one, and restoring @code{FS} (and @code{FIELDWIDTHS}
|
|
if necessary), @code{RS}, and @code{$0}.
|
|
The use of @code{@w{_pw_count}} is explained shortly.
|
|
|
|
@c NEXT ED: All of these functions don't need the ... in ... test. Just
|
|
@c return the array element, which will be "" if not already there. Duh.
|
|
@cindex @code{getpwnam} function (C library)
|
|
The @code{getpwnam} function takes a username as a string argument. If that
|
|
user is in the database, it returns the appropriate line. Otherwise, it
|
|
returns the null string:
|
|
|
|
@cindex @code{getpwnam} user-defined function
|
|
@example
|
|
@group
|
|
@c file eg/lib/passwdawk.in
|
|
function getpwnam(name)
|
|
@{
|
|
_pw_init()
|
|
if (name in _pw_byname)
|
|
return _pw_byname[name]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end group
|
|
@end example
|
|
|
|
@cindex @code{getpwuid} function (C library)
|
|
Similarly,
|
|
the @code{getpwuid} function takes a user ID number argument. If that
|
|
user number is in the database, it returns the appropriate line. Otherwise, it
|
|
returns the null string:
|
|
|
|
@cindex @code{getpwuid} user-defined function
|
|
@example
|
|
@c file eg/lib/passwdawk.in
|
|
function getpwuid(uid)
|
|
@{
|
|
_pw_init()
|
|
if (uid in _pw_byuid)
|
|
return _pw_byuid[uid]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{getpwent} function (C library)
|
|
The @code{getpwent} function simply steps through the database, one entry at
|
|
a time. It uses @code{_pw_count} to track its current position in the
|
|
@code{_pw_bycount} array:
|
|
|
|
@cindex @code{getpwent} user-defined function
|
|
@example
|
|
@c file eg/lib/passwdawk.in
|
|
function getpwent()
|
|
@{
|
|
_pw_init()
|
|
if (_pw_count < _pw_total)
|
|
return _pw_bycount[++_pw_count]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{endpwent} function (C library)
|
|
The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that
|
|
subsequent calls to @code{getpwent} start over again:
|
|
|
|
@cindex @code{endpwent} user-defined function
|
|
@example
|
|
@c file eg/lib/passwdawk.in
|
|
function endpwent()
|
|
@{
|
|
_pw_count = 0
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
A conscious design decision in this suite was made that each subroutine calls
|
|
@code{@w{_pw_init}} to initialize the database arrays. The overhead of running
|
|
a separate process to generate the user database, and the I/O to scan it,
|
|
are only incurred if the user's main program actually calls one of these
|
|
functions. If this library file is loaded along with a user's program, but
|
|
none of the routines are ever called, then there is no extra runtime overhead.
|
|
(The alternative is move the body of @code{@w{_pw_init}} into a
|
|
@code{BEGIN} rule, which always runs @command{pwcat}. This simplifies the
|
|
code but runs an extra process that may never be needed.)
|
|
|
|
In turn, calling @code{_pw_init} is not too expensive, because the
|
|
@code{_pw_inited} variable keeps the program from reading the data more than
|
|
once. If you are worried about squeezing every last cycle out of your
|
|
@command{awk} program, the check of @code{_pw_inited} could be moved out of
|
|
@code{_pw_init} and duplicated in all the other functions. In practice,
|
|
this is not necessary, since most @command{awk} programs are I/O-bound, and it
|
|
clutters up the code.
|
|
|
|
The @command{id} program in @ref{Id Program},
|
|
uses these functions.
|
|
@c ENDOFRANGE libfudata
|
|
@c ENDOFRANGE flibudata
|
|
@c ENDOFRANGE udatar
|
|
@c ENDOFRANGE dataur
|
|
|
|
@node Group Functions
|
|
@section Reading the Group Database
|
|
|
|
@c STARTOFRANGE libfgdata
|
|
@cindex libraries of @command{awk} functions, group database, reading
|
|
@c STARTOFRANGE flibgdata
|
|
@cindex functions, library, group database, reading
|
|
@c STARTOFRANGE gdatar
|
|
@cindex group database, reading
|
|
@c STARTOFRANGE datagr
|
|
@cindex database, group, reading
|
|
@cindex @code{PROCINFO} array
|
|
@cindex @code{getgrent} function (C library)
|
|
@cindex @code{getgrent} user-defined function
|
|
@c comma is part of primary
|
|
@cindex groups, information about
|
|
@cindex account information
|
|
@cindex group file
|
|
@cindex files, group
|
|
Much of the discussion presented in
|
|
@ref{Passwd Functions},
|
|
applies to the group database as well. Although there has traditionally
|
|
been a well-known file (@file{/etc/group}) in a well-known format, the POSIX
|
|
standard only provides a set of C library routines
|
|
(@code{<grp.h>} and @code{getgrent})
|
|
for accessing the information.
|
|
Even though this file may exist, it likely does not have
|
|
complete information. Therefore, as with the user database, it is necessary
|
|
to have a small C program that generates the group database as its output.
|
|
|
|
@cindex @command{grcat} program
|
|
@command{grcat}, a C program that ``cats'' the group database,
|
|
is as follows:
|
|
|
|
@example
|
|
@c file eg/lib/grcat.c
|
|
/*
|
|
* grcat.c
|
|
*
|
|
* Generate a printable version of the group database
|
|
*/
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/grcat.c
|
|
/*
|
|
* Arnold Robbins, arnold@@gnu.org, May 1993
|
|
* Public Domain
|
|
*/
|
|
|
|
/* For OS/2, do nothing. */
|
|
#if HAVE_CONFIG_H
|
|
#include <config.h>
|
|
#endif
|
|
|
|
#if defined (STDC_HEADERS)
|
|
#include <stdlib.h>
|
|
#endif
|
|
|
|
#ifndef HAVE_GETGRENT
|
|
int main() { return 0; }
|
|
#else
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/lib/grcat.c
|
|
#include <stdio.h>
|
|
#include <grp.h>
|
|
|
|
int
|
|
main(argc, argv)
|
|
int argc;
|
|
char **argv;
|
|
@{
|
|
struct group *g;
|
|
int i;
|
|
|
|
while ((g = getgrent()) != NULL) @{
|
|
printf("%s:%s:%ld:", g->gr_name, g->gr_passwd,
|
|
(long) g->gr_gid);
|
|
for (i = 0; g->gr_mem[i] != NULL; i++) @{
|
|
printf("%s", g->gr_mem[i]);
|
|
@group
|
|
if (g->gr_mem[i+1] != NULL)
|
|
putchar(',');
|
|
@}
|
|
@end group
|
|
putchar('\n');
|
|
@}
|
|
endgrent();
|
|
return 0;
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@ignore
|
|
@c file eg/lib/grcat.c
|
|
#endif /* HAVE_GETGRENT */
|
|
@c endfile
|
|
@end ignore
|
|
|
|
Each line in the group database represents one group. The fields are
|
|
separated with colons and represent the following information:
|
|
|
|
@ignore
|
|
@table @asis
|
|
@item Group Name
|
|
The name of the group.
|
|
|
|
@item Group Password
|
|
The encrypted group password. In practice, this field is never used. It is
|
|
usually empty or set to @samp{*}.
|
|
|
|
@item Group ID Number
|
|
The numeric group ID number. This number is unique within the file.
|
|
(On some systems it's a C @code{long}, and not an @code{int}. Thus
|
|
we cast it to @code{long} for all cases.)
|
|
|
|
@item Group Member List
|
|
A comma-separated list of usernames. These users are members of the group.
|
|
Modern Unix systems allow users to be members of several groups
|
|
simultaneously. If your system does, then there are elements
|
|
@code{"group1"} through @code{"group@var{N}"} in @code{PROCINFO}
|
|
for those group ID numbers.
|
|
(Note that @code{PROCINFO} is a @command{gawk} extension;
|
|
@pxref{Built-in Variables}.)
|
|
@end table
|
|
@end ignore
|
|
|
|
@multitable {Encrypted password} {1234567890123456789012345678901234567890123456}
|
|
@item Group name @tab The group's name.
|
|
|
|
@item Group password @tab The group's encrypted password. In practice, this field is never used;
|
|
it is usually empty or set to @samp{*}.
|
|
|
|
@item Group-ID @tab
|
|
The group's numeric group ID number; this number should be unique within the file.
|
|
|
|
@item Group member list @tab
|
|
A comma-separated list of usernames. These users are members of the group.
|
|
Modern Unix systems allow users to be members of several groups
|
|
simultaneously. If your system does, then there are elements
|
|
@code{"group1"} through @code{"group@var{N}"} in @code{PROCINFO}
|
|
for those group ID numbers.
|
|
(Note that @code{PROCINFO} is a @command{gawk} extension;
|
|
@pxref{Built-in Variables}.)
|
|
@end multitable
|
|
|
|
Here is what running @command{grcat} might produce:
|
|
|
|
@example
|
|
$ grcat
|
|
@print{} wheel:*:0:arnold
|
|
@print{} nogroup:*:65534:
|
|
@print{} daemon:*:1:
|
|
@print{} kmem:*:2:
|
|
@print{} staff:*:10:arnold,miriam,andy
|
|
@print{} other:*:20:
|
|
@dots{}
|
|
@end example
|
|
|
|
Here are the functions for obtaining information from the group database.
|
|
There are several, modeled after the C library functions of the same names:
|
|
|
|
@cindex @code{getline} command, @code{_gr_init} user-defined function
|
|
@cindex @code{_gr_init} user-defined function
|
|
@example
|
|
@c file eg/lib/groupawk.in
|
|
# group.awk --- functions for dealing with the group file
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/lib/groupawk.in
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
# Revised October 2000
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c line break on _gr_init for smallbook
|
|
@c file eg/lib/groupawk.in
|
|
BEGIN \
|
|
@{
|
|
# Change to suit your system
|
|
_gr_awklib = "/usr/local/libexec/awk/"
|
|
@}
|
|
|
|
function _gr_init( oldfs, oldrs, olddol0, grcat,
|
|
using_fw, n, a, i)
|
|
@{
|
|
if (_gr_inited)
|
|
return
|
|
|
|
oldfs = FS
|
|
oldrs = RS
|
|
olddol0 = $0
|
|
using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")
|
|
FS = ":"
|
|
RS = "\n"
|
|
|
|
grcat = _gr_awklib "grcat"
|
|
while ((grcat | getline) > 0) @{
|
|
if ($1 in _gr_byname)
|
|
_gr_byname[$1] = _gr_byname[$1] "," $4
|
|
else
|
|
_gr_byname[$1] = $0
|
|
if ($3 in _gr_bygid)
|
|
_gr_bygid[$3] = _gr_bygid[$3] "," $4
|
|
else
|
|
_gr_bygid[$3] = $0
|
|
|
|
n = split($4, a, "[ \t]*,[ \t]*")
|
|
for (i = 1; i <= n; i++)
|
|
if (a[i] in _gr_groupsbyuser)
|
|
_gr_groupsbyuser[a[i]] = \
|
|
_gr_groupsbyuser[a[i]] " " $1
|
|
else
|
|
_gr_groupsbyuser[a[i]] = $1
|
|
|
|
_gr_bycount[++_gr_count] = $0
|
|
@}
|
|
close(grcat)
|
|
_gr_count = 0
|
|
_gr_inited++
|
|
FS = oldfs
|
|
if (using_fw)
|
|
FIELDWIDTHS = FIELDWIDTHS
|
|
RS = oldrs
|
|
$0 = olddol0
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{BEGIN} rule sets a private variable to the directory where
|
|
@command{grcat} is stored. Because it is used to help out an @command{awk} library
|
|
routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might
|
|
want it to be in a different directory on your system.
|
|
|
|
These routines follow the same general outline as the user database routines
|
|
(@pxref{Passwd Functions}).
|
|
The @code{@w{_gr_inited}} variable is used to
|
|
ensure that the database is scanned no more than once.
|
|
The @code{@w{_gr_init}} function first saves @code{FS}, @code{FIELDWIDTHS}, @code{RS}, and
|
|
@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for
|
|
scanning the group information.
|
|
|
|
The group information is stored is several associative arrays.
|
|
The arrays are indexed by group name (@code{@w{_gr_byname}}), by group ID number
|
|
(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}).
|
|
There is an additional array indexed by username (@code{@w{_gr_groupsbyuser}}),
|
|
which is a space-separated list of groups to which each user belongs.
|
|
|
|
Unlike the user database, it is possible to have multiple records in the
|
|
database for the same group. This is common when a group has a large number
|
|
of members. A pair of such entries might look like the following:
|
|
|
|
@example
|
|
tvpeople:*:101:johnny,jay,arsenio
|
|
tvpeople:*:101:david,conan,tom,joan
|
|
@end example
|
|
|
|
For this reason, @code{_gr_init} looks to see if a group name or
|
|
group ID number is already seen. If it is, then the usernames are
|
|
simply concatenated onto the previous list of users. (There is actually a
|
|
subtle problem with the code just presented. Suppose that
|
|
the first time there were no names. This code adds the names with
|
|
a leading comma. It also doesn't check that there is a @code{$4}.)
|
|
|
|
Finally, @code{_gr_init} closes the pipeline to @command{grcat}, restores
|
|
@code{FS} (and @code{FIELDWIDTHS} if necessary), @code{RS}, and @code{$0},
|
|
initializes @code{_gr_count} to zero
|
|
(it is used later), and makes @code{_gr_inited} nonzero.
|
|
|
|
@cindex @code{getgrnam} function (C library)
|
|
The @code{getgrnam} function takes a group name as its argument, and if that
|
|
group exists, it is returned. Otherwise, @code{getgrnam} returns the null
|
|
string:
|
|
|
|
@cindex @code{getgrnam} user-defined function
|
|
@example
|
|
@c file eg/lib/groupawk.in
|
|
function getgrnam(group)
|
|
@{
|
|
_gr_init()
|
|
if (group in _gr_byname)
|
|
return _gr_byname[group]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{getgrgid} function (C library)
|
|
The @code{getgrgid} function is similar, it takes a numeric group ID and
|
|
looks up the information associated with that group ID:
|
|
|
|
@cindex @code{getgrgid} user-defined function
|
|
@example
|
|
@c file eg/lib/groupawk.in
|
|
function getgrgid(gid)
|
|
@{
|
|
_gr_init()
|
|
if (gid in _gr_bygid)
|
|
return _gr_bygid[gid]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{getgruser} function (C library)
|
|
The @code{getgruser} function does not have a C counterpart. It takes a
|
|
username and returns the list of groups that have the user as a member:
|
|
|
|
@cindex @code{getgruser} function, user-defined
|
|
@example
|
|
@c file eg/lib/groupawk.in
|
|
function getgruser(user)
|
|
@{
|
|
_gr_init()
|
|
if (user in _gr_groupsbyuser)
|
|
return _gr_groupsbyuser[user]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{getgrent} function (C library)
|
|
The @code{getgrent} function steps through the database one entry at a time.
|
|
It uses @code{_gr_count} to track its position in the list:
|
|
|
|
@cindex @code{getgrent} user-defined function
|
|
@example
|
|
@c file eg/lib/groupawk.in
|
|
function getgrent()
|
|
@{
|
|
_gr_init()
|
|
if (++_gr_count in _gr_bycount)
|
|
return _gr_bycount[_gr_count]
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@c ENDOFRANGE clibf
|
|
|
|
@cindex @code{endgrent} function (C library)
|
|
The @code{endgrent} function resets @code{_gr_count} to zero so that @code{getgrent} can
|
|
start over again:
|
|
|
|
@cindex @code{endgrent} user-defined function
|
|
@example
|
|
@c file eg/lib/groupawk.in
|
|
function endgrent()
|
|
@{
|
|
_gr_count = 0
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
As with the user database routines, each function calls @code{_gr_init} to
|
|
initialize the arrays. Doing so only incurs the extra overhead of running
|
|
@command{grcat} if these functions are used (as opposed to moving the body of
|
|
@code{_gr_init} into a @code{BEGIN} rule).
|
|
|
|
Most of the work is in scanning the database and building the various
|
|
associative arrays. The functions that the user calls are themselves very
|
|
simple, relying on @command{awk}'s associative arrays to do work.
|
|
|
|
The @command{id} program in @ref{Id Program},
|
|
uses these functions.
|
|
@c ENDOFRANGE libfgdata
|
|
@c ENDOFRANGE flibgdata
|
|
@c ENDOFRANGE gdatar
|
|
@c ENDOFRANGE libf
|
|
@c ENDOFRANGE flib
|
|
@c ENDOFRANGE fudlib
|
|
@c ENDOFRANGE datagr
|
|
|
|
@node Sample Programs
|
|
@chapter Practical @command{awk} Programs
|
|
@c STARTOFRANGE awkpex
|
|
@cindex @command{awk} programs, examples of
|
|
|
|
@ref{Library Functions},
|
|
presents the idea that reading programs in a language contributes to
|
|
learning that language. This @value{CHAPTER} continues that theme,
|
|
presenting a potpourri of @command{awk} programs for your reading
|
|
enjoyment.
|
|
@ifnotinfo
|
|
There are three sections.
|
|
The first describes how to run the programs presented
|
|
in this @value{CHAPTER}.
|
|
|
|
The second presents @command{awk}
|
|
versions of several common POSIX utilities.
|
|
These are programs that you are hopefully already familiar with,
|
|
and therefore, whose problems are understood.
|
|
By reimplementing these programs in @command{awk},
|
|
you can focus on the @command{awk}-related aspects of solving
|
|
the programming problem.
|
|
|
|
The third is a grab bag of interesting programs.
|
|
These solve a number of different data-manipulation and management
|
|
problems. Many of the programs are short, which emphasizes @command{awk}'s
|
|
ability to do a lot in just a few lines of code.
|
|
@end ifnotinfo
|
|
|
|
Many of these programs use the library functions presented in
|
|
@ref{Library Functions}.
|
|
|
|
@menu
|
|
* Running Examples:: How to run these examples.
|
|
* Clones:: Clones of common utilities.
|
|
* Miscellaneous Programs:: Some interesting @command{awk} programs.
|
|
@end menu
|
|
|
|
@node Running Examples
|
|
@section Running the Example Programs
|
|
|
|
To run a given program, you would typically do something like this:
|
|
|
|
@example
|
|
awk -f @var{program} -- @var{options} @var{files}
|
|
@end example
|
|
|
|
@noindent
|
|
Here, @var{program} is the name of the @command{awk} program (such as
|
|
@file{cut.awk}), @var{options} are any command-line options for the
|
|
program that start with a @samp{-}, and @var{files} are the actual @value{DF}s.
|
|
|
|
If your system supports the @samp{#!} executable interpreter mechanism
|
|
(@pxref{Executable Scripts}),
|
|
you can instead run your program directly:
|
|
|
|
@example
|
|
cut.awk -c1-8 myfiles > results
|
|
@end example
|
|
|
|
If your @command{awk} is not @command{gawk}, you may instead need to use this:
|
|
|
|
@example
|
|
cut.awk -- -c1-8 myfiles > results
|
|
@end example
|
|
|
|
@node Clones
|
|
@section Reinventing Wheels for Fun and Profit
|
|
@c last comma is part of secondary
|
|
@c STARTOFRANGE posimawk
|
|
@cindex POSIX, programs, implementing in @command{awk}
|
|
|
|
This @value{SECTION} presents a number of POSIX utilities that are implemented in
|
|
@command{awk}. Reinventing these programs in @command{awk} is often enjoyable,
|
|
because the algorithms can be very clearly expressed, and the code is usually
|
|
very concise and simple. This is true because @command{awk} does so much for you.
|
|
|
|
It should be noted that these programs are not necessarily intended to
|
|
replace the installed versions on your system. Instead, their
|
|
purpose is to illustrate @command{awk} language programming for ``real world''
|
|
tasks.
|
|
|
|
The programs are presented in alphabetical order.
|
|
|
|
@menu
|
|
* Cut Program:: The @command{cut} utility.
|
|
* Egrep Program:: The @command{egrep} utility.
|
|
* Id Program:: The @command{id} utility.
|
|
* Split Program:: The @command{split} utility.
|
|
* Tee Program:: The @command{tee} utility.
|
|
* Uniq Program:: The @command{uniq} utility.
|
|
* Wc Program:: The @command{wc} utility.
|
|
@end menu
|
|
|
|
@node Cut Program
|
|
@subsection Cutting out Fields and Columns
|
|
|
|
@cindex @command{cut} utility
|
|
@c STARTOFRANGE cut
|
|
@cindex @command{cut} utility
|
|
@c STARTOFRANGE ficut
|
|
@cindex fields, cutting
|
|
@c STARTOFRANGE colcut
|
|
@cindex columns, cutting
|
|
The @command{cut} utility selects, or ``cuts,'' characters or fields
|
|
from its standard input and sends them to its standard output.
|
|
Fields are separated by tabs by default,
|
|
but you may supply a command-line option to change the field
|
|
@dfn{delimiter} (i.e., the field-separator character). @command{cut}'s
|
|
definition of fields is less general than @command{awk}'s.
|
|
|
|
A common use of @command{cut} might be to pull out just the login name of
|
|
logged-on users from the output of @command{who}. For example, the following
|
|
pipeline generates a sorted, unique list of the logged-on users:
|
|
|
|
@example
|
|
who | cut -c1-8 | sort | uniq
|
|
@end example
|
|
|
|
The options for @command{cut} are:
|
|
|
|
@table @code
|
|
@item -c @var{list}
|
|
Use @var{list} as the list of characters to cut out. Items within the list
|
|
may be separated by commas, and ranges of characters can be separated with
|
|
dashes. The list @samp{1-8,15,22-35} specifies characters 1 through
|
|
8, 15, and 22 through 35.
|
|
|
|
@item -f @var{list}
|
|
Use @var{list} as the list of fields to cut out.
|
|
|
|
@item -d @var{delim}
|
|
Use @var{delim} as the field-separator character instead of the tab
|
|
character.
|
|
|
|
@item -s
|
|
Suppress printing of lines that do not contain the field delimiter.
|
|
@end table
|
|
|
|
The @command{awk} implementation of @command{cut} uses the @code{getopt} library
|
|
function (@pxref{Getopt Function})
|
|
and the @code{join} library function
|
|
(@pxref{Join Function}).
|
|
|
|
The program begins with a comment describing the options, the library
|
|
functions needed, and a @code{usage} function that prints out a usage
|
|
message and exits. @code{usage} is called if invalid arguments are
|
|
supplied:
|
|
|
|
@cindex @code{cut.awk} program
|
|
@example
|
|
@c file eg/prog/cut.awk
|
|
# cut.awk --- implement cut in awk
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/cut.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/cut.awk
|
|
# Options:
|
|
# -f list Cut fields
|
|
# -d c Field delimiter character
|
|
# -c list Cut characters
|
|
#
|
|
# -s Suppress lines without the delimiter
|
|
#
|
|
# Requires getopt and join library functions
|
|
|
|
@group
|
|
function usage( e1, e2)
|
|
@{
|
|
e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
|
|
e2 = "usage: cut [-c list] [files...]"
|
|
print e1 > "/dev/stderr"
|
|
print e2 > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@end group
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
The variables @code{e1} and @code{e2} are used so that the function
|
|
fits nicely on the
|
|
@ifnotinfo
|
|
page.
|
|
@end ifnotinfo
|
|
@ifnottex
|
|
screen.
|
|
@end ifnottex
|
|
|
|
@cindex @code{BEGIN} pattern, running @command{awk} programs and
|
|
@cindex @code{FS} variable, running @command{awk} programs and
|
|
Next comes a @code{BEGIN} rule that parses the command-line options.
|
|
It sets @code{FS} to a single TAB character, because that is @command{cut}'s
|
|
default field separator. The output field separator is also set to be the
|
|
same as the input field separator. Then @code{getopt} is used to step
|
|
through the command-line options. Exactly one of the variables
|
|
@code{by_fields} or @code{by_chars} is set to true, to indicate that
|
|
processing should be done by fields or by characters, respectively.
|
|
When cutting by characters, the output field separator is set to the null
|
|
string:
|
|
|
|
@example
|
|
@c file eg/prog/cut.awk
|
|
BEGIN \
|
|
@{
|
|
FS = "\t" # default
|
|
OFS = FS
|
|
while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
|
|
if (c == "f") @{
|
|
by_fields = 1
|
|
fieldlist = Optarg
|
|
@} else if (c == "c") @{
|
|
by_chars = 1
|
|
fieldlist = Optarg
|
|
OFS = ""
|
|
@} else if (c == "d") @{
|
|
if (length(Optarg) > 1) @{
|
|
printf("Using first character of %s" \
|
|
" for delimiter\n", Optarg) > "/dev/stderr"
|
|
Optarg = substr(Optarg, 1, 1)
|
|
@}
|
|
FS = Optarg
|
|
OFS = FS
|
|
if (FS == " ") # defeat awk semantics
|
|
FS = "[ ]"
|
|
@} else if (c == "s")
|
|
suppress++
|
|
else
|
|
usage()
|
|
@}
|
|
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex field separators, spaces as
|
|
Special care is taken when the field delimiter is a space. Using
|
|
a single space (@code{@w{" "}}) for the value of @code{FS} is
|
|
incorrect---@command{awk} would separate fields with runs of spaces,
|
|
tabs, and/or newlines, and we want them to be separated with individual
|
|
spaces. Also, note that after @code{getopt} is through, we have to
|
|
clear out all the elements of @code{ARGV} from 1 to @code{Optind},
|
|
so that @command{awk} does not try to process the command-line options
|
|
as @value{FN}s.
|
|
|
|
After dealing with the command-line options, the program verifies that the
|
|
options make sense. Only one or the other of @option{-c} and @option{-f}
|
|
should be used, and both require a field list. Then the program calls
|
|
either @code{set_fieldlist} or @code{set_charlist} to pull apart the
|
|
list of fields or characters:
|
|
|
|
@example
|
|
@c file eg/prog/cut.awk
|
|
if (by_fields && by_chars)
|
|
usage()
|
|
|
|
if (by_fields == 0 && by_chars == 0)
|
|
by_fields = 1 # default
|
|
|
|
if (fieldlist == "") @{
|
|
print "cut: needs list for -c or -f" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
|
|
if (by_fields)
|
|
set_fieldlist()
|
|
else
|
|
set_charlist()
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@code{set_fieldlist} is used to split the field list apart at the commas
|
|
and into an array. Then, for each element of the array, it looks to
|
|
see if it is actually a range, and if so, splits it apart. The range
|
|
is verified to make sure the first number is smaller than the second.
|
|
Each number in the list is added to the @code{flist} array, which
|
|
simply lists the fields that will be printed. Normal field splitting
|
|
is used. The program lets @command{awk} handle the job of doing the
|
|
field splitting:
|
|
|
|
@example
|
|
@c file eg/prog/cut.awk
|
|
function set_fieldlist( n, m, i, j, k, f, g)
|
|
@{
|
|
n = split(fieldlist, f, ",")
|
|
j = 1 # index in flist
|
|
for (i = 1; i <= n; i++) @{
|
|
if (index(f[i], "-") != 0) @{ # a range
|
|
m = split(f[i], g, "-")
|
|
@group
|
|
if (m != 2 || g[1] >= g[2]) @{
|
|
printf("bad field list: %s\n",
|
|
f[i]) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@end group
|
|
for (k = g[1]; k <= g[2]; k++)
|
|
flist[j++] = k
|
|
@} else
|
|
flist[j++] = f[i]
|
|
@}
|
|
nfields = j - 1
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{set_charlist} function is more complicated than @code{set_fieldlist}.
|
|
The idea here is to use @command{gawk}'s @code{FIELDWIDTHS} variable
|
|
(@pxref{Constant Size}),
|
|
which describes constant-width input. When using a character list, that is
|
|
exactly what we have.
|
|
|
|
Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
|
|
fields that need to be printed. We have to keep track of the fields to
|
|
print and also the intervening characters that have to be skipped.
|
|
For example, suppose you wanted characters 1 through 8, 15, and
|
|
22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value
|
|
for @code{FIELDWIDTHS} is @code{@w{"8 6 1 6 14"}}. This yields five
|
|
fields, and the fields to print
|
|
are @code{$1}, @code{$3}, and @code{$5}.
|
|
The intermediate fields are @dfn{filler},
|
|
which is stuff in between the desired data.
|
|
@code{flist} lists the fields to print, and @code{t} tracks the
|
|
complete field list, including filler fields:
|
|
|
|
@example
|
|
@c file eg/prog/cut.awk
|
|
function set_charlist( field, i, j, f, g, t,
|
|
filler, last, len)
|
|
@{
|
|
field = 1 # count total fields
|
|
n = split(fieldlist, f, ",")
|
|
j = 1 # index in flist
|
|
for (i = 1; i <= n; i++) @{
|
|
if (index(f[i], "-") != 0) @{ # range
|
|
m = split(f[i], g, "-")
|
|
if (m != 2 || g[1] >= g[2]) @{
|
|
printf("bad character list: %s\n",
|
|
f[i]) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
len = g[2] - g[1] + 1
|
|
if (g[1] > 1) # compute length of filler
|
|
filler = g[1] - last - 1
|
|
else
|
|
filler = 0
|
|
@group
|
|
if (filler)
|
|
t[field++] = filler
|
|
@end group
|
|
t[field++] = len # length of field
|
|
last = g[2]
|
|
flist[j++] = field - 1
|
|
@} else @{
|
|
if (f[i] > 1)
|
|
filler = f[i] - last - 1
|
|
else
|
|
filler = 0
|
|
if (filler)
|
|
t[field++] = filler
|
|
t[field++] = 1
|
|
last = f[i]
|
|
flist[j++] = field - 1
|
|
@}
|
|
@}
|
|
FIELDWIDTHS = join(t, 1, field - 1)
|
|
nfields = j - 1
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
Next is the rule that actually processes the data. If the @option{-s} option
|
|
is given, then @code{suppress} is true. The first @code{if} statement
|
|
makes sure that the input record does have the field separator. If
|
|
@command{cut} is processing fields, @code{suppress} is true, and the field
|
|
separator character is not in the record, then the record is skipped.
|
|
|
|
If the record is valid, then @command{gawk} has split the data
|
|
into fields, either using the character in @code{FS} or using fixed-length
|
|
fields and @code{FIELDWIDTHS}. The loop goes through the list of fields
|
|
that should be printed. The corresponding field is printed if it contains data.
|
|
If the next field also has data, then the separator character is
|
|
written out between the fields:
|
|
|
|
@example
|
|
@c file eg/prog/cut.awk
|
|
@{
|
|
if (by_fields && suppress && index($0, FS) != 0)
|
|
next
|
|
|
|
for (i = 1; i <= nfields; i++) @{
|
|
if ($flist[i] != "") @{
|
|
printf "%s", $flist[i]
|
|
if (i < nfields && $flist[i+1] != "")
|
|
printf "%s", OFS
|
|
@}
|
|
@}
|
|
print ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
This version of @command{cut} relies on @command{gawk}'s @code{FIELDWIDTHS}
|
|
variable to do the character-based cutting. While it is possible in
|
|
other @command{awk} implementations to use @code{substr}
|
|
(@pxref{String Functions}),
|
|
it is also extremely painful.
|
|
The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
|
|
of picking the input line apart by characters.
|
|
@c ENDOFRANGE cut
|
|
@c ENDOFRANGE ficut
|
|
@c ENDOFRANGE colcut
|
|
|
|
@c Exercise: Rewrite using split with "".
|
|
|
|
@node Egrep Program
|
|
@subsection Searching for Regular Expressions in Files
|
|
|
|
@c STARTOFRANGE regexps
|
|
@cindex regular expressions, searching for
|
|
@c STARTOFRANGE sfregexp
|
|
@cindex searching, files for regular expressions
|
|
@c STARTOFRANGE fsregexp
|
|
@cindex files, searching for regular expressions
|
|
@cindex @command{egrep} utility
|
|
The @command{egrep} utility searches files for patterns. It uses regular
|
|
expressions that are almost identical to those available in @command{awk}
|
|
(@pxref{Regexp}).
|
|
It is used in the following manner:
|
|
|
|
@example
|
|
egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}
|
|
@end example
|
|
|
|
The @var{pattern} is a regular expression. In typical usage, the regular
|
|
expression is quoted to prevent the shell from expanding any of the
|
|
special characters as @value{FN} wildcards. Normally, @command{egrep}
|
|
prints the lines that matched. If multiple @value{FN}s are provided on
|
|
the command line, each output line is preceded by the name of the file
|
|
and a colon.
|
|
|
|
The options to @command{egrep} are as follows:
|
|
|
|
@table @code
|
|
@item -c
|
|
Print out a count of the lines that matched the pattern, instead of the
|
|
lines themselves.
|
|
|
|
@item -s
|
|
Be silent. No output is produced and the exit value indicates whether
|
|
the pattern was matched.
|
|
|
|
@item -v
|
|
Invert the sense of the test. @command{egrep} prints the lines that do
|
|
@emph{not} match the pattern and exits successfully if the pattern is not
|
|
matched.
|
|
|
|
@item -i
|
|
Ignore case distinctions in both the pattern and the input data.
|
|
|
|
@item -l
|
|
Only print (list) the names of the files that matched, not the lines that matched.
|
|
|
|
@item -e @var{pattern}
|
|
Use @var{pattern} as the regexp to match. The purpose of the @option{-e}
|
|
option is to allow patterns that start with a @samp{-}.
|
|
@end table
|
|
|
|
This version uses the @code{getopt} library function
|
|
(@pxref{Getopt Function})
|
|
and the file transition library program
|
|
(@pxref{Filetrans Function}).
|
|
|
|
The program begins with a descriptive comment and then a @code{BEGIN} rule
|
|
that processes the command-line arguments with @code{getopt}. The @option{-i}
|
|
(ignore case) option is particularly easy with @command{gawk}; we just use the
|
|
@code{IGNORECASE} built-in variable
|
|
(@pxref{Built-in Variables}):
|
|
|
|
@cindex @code{egrep.awk} program
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
# egrep.awk --- simulate egrep in awk
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/egrep.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/egrep.awk
|
|
# Options:
|
|
# -c count of lines
|
|
# -s silent - use exit value
|
|
# -v invert test, success if no match
|
|
# -i ignore case
|
|
# -l print filenames only
|
|
# -e argument is pattern
|
|
#
|
|
# Requires getopt and file transition library functions
|
|
|
|
BEGIN @{
|
|
while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{
|
|
if (c == "c")
|
|
count_only++
|
|
else if (c == "s")
|
|
no_print++
|
|
else if (c == "v")
|
|
invert++
|
|
else if (c == "i")
|
|
IGNORECASE = 1
|
|
else if (c == "l")
|
|
filenames_only++
|
|
else if (c == "e")
|
|
pattern = Optarg
|
|
else
|
|
usage()
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
Next comes the code that handles the @command{egrep}-specific behavior. If no
|
|
pattern is supplied with @option{-e}, the first nonoption on the
|
|
command line is used. The @command{awk} command-line arguments up to @code{ARGV[Optind]}
|
|
are cleared, so that @command{awk} won't try to process them as files. If no
|
|
files are specified, the standard input is used, and if multiple files are
|
|
specified, we make sure to note this so that the @value{FN}s can precede the
|
|
matched lines in the output:
|
|
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
if (pattern == "")
|
|
pattern = ARGV[Optind++]
|
|
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
if (Optind >= ARGC) @{
|
|
ARGV[1] = "-"
|
|
ARGC = 2
|
|
@} else if (ARGC - Optind > 1)
|
|
do_filenames++
|
|
|
|
# if (IGNORECASE)
|
|
# pattern = tolower(pattern)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The last two lines are commented out, since they are not needed in
|
|
@command{gawk}. They should be uncommented if you have to use another version
|
|
of @command{awk}.
|
|
|
|
The next set of lines should be uncommented if you are not using
|
|
@command{gawk}. This rule translates all the characters in the input line
|
|
into lowercase if the @option{-i} option is specified.@footnote{It
|
|
also introduces a subtle bug;
|
|
if a match happens, we output the translated line, not the original.}
|
|
The rule is
|
|
commented out since it is not necessary with @command{gawk}:
|
|
|
|
@c Exercise: Fix this, w/array and new line as key to original line
|
|
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
#@{
|
|
# if (IGNORECASE)
|
|
# $0 = tolower($0)
|
|
#@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{beginfile} function is called by the rule in @file{ftrans.awk}
|
|
when each new file is processed. In this case, it is very simple; all it
|
|
does is initialize a variable @code{fcount} to zero. @code{fcount} tracks
|
|
how many lines in the current file matched the pattern
|
|
(naming the parameter @code{junk} shows we know that @code{beginfile}
|
|
is called with a parameter, but that we're not interested in its value):
|
|
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
function beginfile(junk)
|
|
@{
|
|
fcount = 0
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{endfile} function is called after each file has been processed.
|
|
It affects the output only when the user wants a count of the number of lines that
|
|
matched. @code{no_print} is true only if the exit status is desired.
|
|
@code{count_only} is true if line counts are desired. @command{egrep}
|
|
therefore only prints line counts if printing and counting are enabled.
|
|
The output format must be adjusted depending upon the number of files to
|
|
process. Finally, @code{fcount} is added to @code{total}, so that we
|
|
know the total number of lines that matched the pattern:
|
|
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
function endfile(file)
|
|
@{
|
|
if (! no_print && count_only)
|
|
if (do_filenames)
|
|
print file ":" fcount
|
|
else
|
|
print fcount
|
|
|
|
total += fcount
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The following rule does most of the work of matching lines. The variable
|
|
@code{matches} is true if the line matched the pattern. If the user
|
|
wants lines that did not match, the sense of @code{matches} is inverted
|
|
using the @samp{!} operator. @code{fcount} is incremented with the value of
|
|
@code{matches}, which is either one or zero, depending upon a
|
|
successful or unsuccessful match. If the line does not match, the
|
|
@code{next} statement just moves on to the next record.
|
|
|
|
@cindex @code{!} (exclamation point), @code{!} operator
|
|
@cindex exclamation point (@code{!}), @code{!} operator
|
|
A number of additional tests are made, but they are only done if we
|
|
are not counting lines. First, if the user only wants exit status
|
|
(@code{no_print} is true), then it is enough to know that @emph{one}
|
|
line in this file matched, and we can skip on to the next file with
|
|
@code{nextfile}. Similarly, if we are only printing @value{FN}s, we can
|
|
print the @value{FN}, and then skip to the next file with @code{nextfile}.
|
|
Finally, each line is printed, with a leading @value{FN} and colon
|
|
if necessary:
|
|
|
|
@cindex @code{!} operator
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
@{
|
|
matches = ($0 ~ pattern)
|
|
if (invert)
|
|
matches = ! matches
|
|
|
|
fcount += matches # 1 or 0
|
|
|
|
if (! matches)
|
|
next
|
|
|
|
if (! count_only) @{
|
|
if (no_print)
|
|
nextfile
|
|
|
|
if (filenames_only) @{
|
|
print FILENAME
|
|
nextfile
|
|
@}
|
|
|
|
if (do_filenames)
|
|
print FILENAME ":" $0
|
|
else
|
|
print
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{END} rule takes care of producing the correct exit status. If
|
|
there are no matches, the exit status is one; otherwise it is zero:
|
|
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
END \
|
|
@{
|
|
if (total == 0)
|
|
exit 1
|
|
exit 0
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{usage} function prints a usage message in case of invalid options,
|
|
and then exits:
|
|
|
|
@example
|
|
@c file eg/prog/egrep.awk
|
|
function usage( e)
|
|
@{
|
|
e = "Usage: egrep [-csvil] [-e pat] [files ...]"
|
|
e = e "\n\tegrep [-csvil] pat [files ...]"
|
|
print e > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The variable @code{e} is used so that the function fits nicely
|
|
on the printed page.
|
|
|
|
@cindex @code{END} pattern, backslash continuation and
|
|
@cindex @code{\} (backslash), continuing lines and
|
|
@cindex backslash (@code{\}), continuing lines and
|
|
Just a note on programming style: you may have noticed that the @code{END}
|
|
rule uses backslash continuation, with the open brace on a line by
|
|
itself. This is so that it more closely resembles the way functions
|
|
are written. Many of the examples
|
|
in this @value{CHAPTER}
|
|
use this style. You can decide for yourself if you like writing
|
|
your @code{BEGIN} and @code{END} rules this way
|
|
or not.
|
|
@c ENDOFRANGE regexps
|
|
@c ENDOFRANGE sfregexp
|
|
@c ENDOFRANGE fsregexp
|
|
|
|
@node Id Program
|
|
@subsection Printing out User Information
|
|
|
|
@cindex printing, user information
|
|
@cindex users, information about, printing
|
|
@cindex @command{id} utility
|
|
The @command{id} utility lists a user's real and effective user ID numbers,
|
|
real and effective group ID numbers, and the user's group set, if any.
|
|
@command{id} only prints the effective user ID and group ID if they are
|
|
different from the real ones. If possible, @command{id} also supplies the
|
|
corresponding user and group names. The output might look like this:
|
|
|
|
@example
|
|
$ id
|
|
@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty)
|
|
@end example
|
|
|
|
This information is part of what is provided by @command{gawk}'s
|
|
@code{PROCINFO} array (@pxref{Built-in Variables}).
|
|
However, the @command{id} utility provides a more palatable output than just
|
|
individual numbers.
|
|
|
|
Here is a simple version of @command{id} written in @command{awk}.
|
|
It uses the user database library functions
|
|
(@pxref{Passwd Functions})
|
|
and the group database library functions
|
|
(@pxref{Group Functions}):
|
|
|
|
The program is fairly straightforward. All the work is done in the
|
|
@code{BEGIN} rule. The user and group ID numbers are obtained from
|
|
@code{PROCINFO}.
|
|
The code is repetitive. The entry in the user database for the real user ID
|
|
number is split into parts at the @samp{:}. The name is the first field.
|
|
Similar code is used for the effective user ID number and the group
|
|
numbers:
|
|
|
|
@cindex @code{id.awk} program
|
|
@example
|
|
@c file eg/prog/id.awk
|
|
# id.awk --- implement id in awk
|
|
#
|
|
# Requires user and group library functions
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/id.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
# Revised February 1996
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/id.awk
|
|
# output is:
|
|
# uid=12(foo) euid=34(bar) gid=3(baz) \
|
|
# egid=5(blat) groups=9(nine),2(two),1(one)
|
|
|
|
@group
|
|
BEGIN \
|
|
@{
|
|
uid = PROCINFO["uid"]
|
|
euid = PROCINFO["euid"]
|
|
gid = PROCINFO["gid"]
|
|
egid = PROCINFO["egid"]
|
|
@end group
|
|
|
|
printf("uid=%d", uid)
|
|
pw = getpwuid(uid)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
|
|
if (euid != uid) @{
|
|
printf(" euid=%d", euid)
|
|
pw = getpwuid(euid)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
@}
|
|
|
|
printf(" gid=%d", gid)
|
|
pw = getgrgid(gid)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
|
|
if (egid != gid) @{
|
|
printf(" egid=%d", egid)
|
|
pw = getgrgid(egid)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
@}
|
|
|
|
for (i = 1; ("group" i) in PROCINFO; i++) @{
|
|
if (i == 1)
|
|
printf(" groups=")
|
|
group = PROCINFO["group" i]
|
|
printf("%d", group)
|
|
pw = getgrgid(group)
|
|
if (pw != "") @{
|
|
split(pw, a, ":")
|
|
printf("(%s)", a[1])
|
|
@}
|
|
if (("group" (i+1)) in PROCINFO)
|
|
printf(",")
|
|
@}
|
|
|
|
print ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @code{in} operator
|
|
The test in the @code{for} loop is worth noting.
|
|
Any supplementary groups in the @code{PROCINFO} array have the
|
|
indices @code{"group1"} through @code{"group@var{N}"} for some
|
|
@var{N}, i.e., the total number of supplementary groups.
|
|
However, we don't know in advance how many of these groups
|
|
there are.
|
|
|
|
This loop works by starting at one, concatenating the value with
|
|
@code{"group"}, and then using @code{in} to see if that value is
|
|
in the array. Eventually, @code{i} is incremented past
|
|
the last group in the array and the loop exits.
|
|
|
|
The loop is also correct if there are @emph{no} supplementary
|
|
groups; then the condition is false the first time it's
|
|
tested, and the loop body never executes.
|
|
|
|
@c exercise!!!
|
|
@ignore
|
|
The POSIX version of @command{id} takes arguments that control which
|
|
information is printed. Modify this version to accept the same
|
|
arguments and perform in the same way.
|
|
@end ignore
|
|
|
|
@node Split Program
|
|
@subsection Splitting a Large File into Pieces
|
|
|
|
@c STARTOFRANGE filspl
|
|
@cindex files, splitting
|
|
@cindex @code{split} utility
|
|
The @code{split} program splits large text files into smaller pieces.
|
|
Usage is as follows:
|
|
|
|
@example
|
|
split @r{[}-@var{count}@r{]} file @r{[} @var{prefix} @r{]}
|
|
@end example
|
|
|
|
By default,
|
|
the output files are named @file{xaa}, @file{xab}, and so on. Each file has
|
|
1000 lines in it, with the likely exception of the last file. To change the
|
|
number of lines in each file, supply a number on the command line
|
|
preceded with a minus; e.g., @samp{-500} for files with 500 lines in them
|
|
instead of 1000. To change the name of the output files to something like
|
|
@file{myfileaa}, @file{myfileab}, and so on, supply an additional
|
|
argument that specifies the @value{FN} prefix.
|
|
|
|
Here is a version of @code{split} in @command{awk}. It uses the @code{ord} and
|
|
@code{chr} functions presented in
|
|
@ref{Ordinal Functions}.
|
|
|
|
The program first sets its defaults, and then tests to make sure there are
|
|
not too many arguments. It then looks at each argument in turn. The
|
|
first argument could be a minus sign followed by a number. If it is, this happens
|
|
to look like a negative number, so it is made positive, and that is the
|
|
count of lines. The data @value{FN} is skipped over and the final argument
|
|
is used as the prefix for the output @value{FN}s:
|
|
|
|
@cindex @code{split.awk} program
|
|
@example
|
|
@c file eg/prog/split.awk
|
|
# split.awk --- do split in awk
|
|
#
|
|
# Requires ord and chr library functions
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/split.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/split.awk
|
|
# usage: split [-num] [file] [outname]
|
|
|
|
BEGIN @{
|
|
outfile = "x" # default
|
|
count = 1000
|
|
if (ARGC > 4)
|
|
usage()
|
|
|
|
i = 1
|
|
if (ARGV[i] ~ /^-[0-9]+$/) @{
|
|
count = -ARGV[i]
|
|
ARGV[i] = ""
|
|
i++
|
|
@}
|
|
# test argv in case reading from stdin instead of file
|
|
if (i in ARGV)
|
|
i++ # skip data file name
|
|
if (i in ARGV) @{
|
|
outfile = ARGV[i]
|
|
ARGV[i] = ""
|
|
@}
|
|
|
|
s1 = s2 = "a"
|
|
out = (outfile s1 s2)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The next rule does most of the work. @code{tcount} (temporary count) tracks
|
|
how many lines have been printed to the output file so far. If it is greater
|
|
than @code{count}, it is time to close the current file and start a new one.
|
|
@code{s1} and @code{s2} track the current suffixes for the @value{FN}. If
|
|
they are both @samp{z}, the file is just too big. Otherwise, @code{s1}
|
|
moves to the next letter in the alphabet and @code{s2} starts over again at
|
|
@samp{a}:
|
|
|
|
@c else on separate line here for page breaking
|
|
@example
|
|
@c file eg/prog/split.awk
|
|
@{
|
|
if (++tcount > count) @{
|
|
close(out)
|
|
if (s2 == "z") @{
|
|
if (s1 == "z") @{
|
|
printf("split: %s is too large to split\n",
|
|
FILENAME) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
s1 = chr(ord(s1) + 1)
|
|
s2 = "a"
|
|
@}
|
|
@group
|
|
else
|
|
s2 = chr(ord(s2) + 1)
|
|
@end group
|
|
out = (outfile s1 s2)
|
|
tcount = 1
|
|
@}
|
|
print > out
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@c Exercise: do this with just awk builtin functions, index("abc..."), substr, etc.
|
|
|
|
@noindent
|
|
The @code{usage} function simply prints an error message and exits:
|
|
|
|
@example
|
|
@c file eg/prog/split.awk
|
|
function usage( e)
|
|
@{
|
|
e = "usage: split [-num] [file] [outname]"
|
|
print e > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
The variable @code{e} is used so that the function
|
|
fits nicely on the
|
|
@ifinfo
|
|
screen.
|
|
@end ifinfo
|
|
@ifnotinfo
|
|
page.
|
|
@end ifnotinfo
|
|
|
|
This program is a bit sloppy; it relies on @command{awk} to automatically close the last file
|
|
instead of doing it in an @code{END} rule.
|
|
It also assumes that letters are contiguous in the character set,
|
|
which isn't true for EBCDIC systems.
|
|
@c BFD...
|
|
@c ENDOFRANGE filspl
|
|
|
|
@node Tee Program
|
|
@subsection Duplicating Output into Multiple Files
|
|
|
|
@c last comma is part of secondary
|
|
@cindex files, multiple, duplicating output into
|
|
@cindex output, duplicating into files
|
|
@cindex @code{tee} utility
|
|
The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies
|
|
its standard input to its standard output and also duplicates it to the
|
|
files named on the command line. Its usage is as follows:
|
|
|
|
@example
|
|
tee @r{[}-a@r{]} file @dots{}
|
|
@end example
|
|
|
|
The @option{-a} option tells @code{tee} to append to the named files, instead of
|
|
truncating them and starting over.
|
|
|
|
The @code{BEGIN} rule first makes a copy of all the command-line arguments
|
|
into an array named @code{copy}.
|
|
@code{ARGV[0]} is not copied, since it is not needed.
|
|
@code{tee} cannot use @code{ARGV} directly, since @command{awk} attempts to
|
|
process each @value{FN} in @code{ARGV} as input data.
|
|
|
|
@cindex flag variables
|
|
If the first argument is @option{-a}, then the flag variable
|
|
@code{append} is set to true, and both @code{ARGV[1]} and
|
|
@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no
|
|
@value{FN}s were supplied and @code{tee} prints a usage message and exits.
|
|
Finally, @command{awk} is forced to read the standard input by setting
|
|
@code{ARGV[1]} to @code{"-"} and @code{ARGC} to two:
|
|
|
|
@c NEXT ED: Add more leading commentary in this program
|
|
@cindex @code{tee.awk} program
|
|
@example
|
|
@c file eg/prog/tee.awk
|
|
# tee.awk --- tee in awk
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/tee.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
# Revised December 1995
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/tee.awk
|
|
BEGIN \
|
|
@{
|
|
for (i = 1; i < ARGC; i++)
|
|
copy[i] = ARGV[i]
|
|
|
|
if (ARGV[1] == "-a") @{
|
|
append = 1
|
|
delete ARGV[1]
|
|
delete copy[1]
|
|
ARGC--
|
|
@}
|
|
if (ARGC < 2) @{
|
|
print "usage: tee [-a] file ..." > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
ARGV[1] = "-"
|
|
ARGC = 2
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The single rule does all the work. Since there is no pattern, it is
|
|
executed for each line of input. The body of the rule simply prints the
|
|
line into each file on the command line, and then to the standard output:
|
|
|
|
@example
|
|
@c file eg/prog/tee.awk
|
|
@{
|
|
# moving the if outside the loop makes it run faster
|
|
if (append)
|
|
for (i in copy)
|
|
print >> copy[i]
|
|
else
|
|
for (i in copy)
|
|
print > copy[i]
|
|
print
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
It is also possible to write the loop this way:
|
|
|
|
@example
|
|
for (i in copy)
|
|
if (append)
|
|
print >> copy[i]
|
|
else
|
|
print > copy[i]
|
|
@end example
|
|
|
|
@noindent
|
|
This is more concise but it is also less efficient. The @samp{if} is
|
|
tested for each record and for each output file. By duplicating the loop
|
|
body, the @samp{if} is only tested once for each input record. If there are
|
|
@var{N} input records and @var{M} output files, the first method only
|
|
executes @var{N} @samp{if} statements, while the second executes
|
|
@var{N}@code{*}@var{M} @samp{if} statements.
|
|
|
|
Finally, the @code{END} rule cleans up by closing all the output files:
|
|
|
|
@example
|
|
@c file eg/prog/tee.awk
|
|
END \
|
|
@{
|
|
for (i in copy)
|
|
close(copy[i])
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@node Uniq Program
|
|
@subsection Printing Nonduplicated Lines of Text
|
|
|
|
@c STARTOFRANGE prunt
|
|
@cindex printing, unduplicated lines of text
|
|
@c first comma is part of primary
|
|
@c STARTOFRANGE tpul
|
|
@cindex text, printing, unduplicated lines of
|
|
@cindex @command{uniq} utility
|
|
The @command{uniq} utility reads sorted lines of data on its standard
|
|
input, and by default removes duplicate lines. In other words, it only
|
|
prints unique lines---hence the name. @command{uniq} has a number of
|
|
options. The usage is as follows:
|
|
|
|
@example
|
|
uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]}
|
|
@end example
|
|
|
|
The options for @command{uniq} are:
|
|
|
|
@table @code
|
|
@item -d
|
|
Pnly print only repeated lines.
|
|
|
|
@item -u
|
|
Print only nonrepeated lines.
|
|
|
|
@item -c
|
|
Count lines. This option overrides @option{-d} and @option{-u}. Both repeated
|
|
and nonrepeated lines are counted.
|
|
|
|
@item -@var{n}
|
|
Skip @var{n} fields before comparing lines. The definition of fields
|
|
is similar to @command{awk}'s default: nonwhitespace characters separated
|
|
by runs of spaces and/or tabs.
|
|
|
|
@item +@var{n}
|
|
Skip @var{n} characters before comparing lines. Any fields specified with
|
|
@samp{-@var{n}} are skipped first.
|
|
|
|
@item @var{input file}
|
|
Data is read from the input file named on the command line, instead of from
|
|
the standard input.
|
|
|
|
@item @var{output file}
|
|
The generated output is sent to the named output file, instead of to the
|
|
standard output.
|
|
@end table
|
|
|
|
Normally @command{uniq} behaves as if both the @option{-d} and
|
|
@option{-u} options are provided.
|
|
|
|
@command{uniq} uses the
|
|
@code{getopt} library function
|
|
(@pxref{Getopt Function})
|
|
and the @code{join} library function
|
|
(@pxref{Join Function}).
|
|
|
|
The program begins with a @code{usage} function and then a brief outline of
|
|
the options and their meanings in a comment.
|
|
The @code{BEGIN} rule deals with the command-line arguments and options. It
|
|
uses a trick to get @code{getopt} to handle options of the form @samp{-25},
|
|
treating such an option as the option letter @samp{2} with an argument of
|
|
@samp{5}. If indeed two or more digits are supplied (@code{Optarg} looks
|
|
like a number), @code{Optarg} is
|
|
concatenated with the option digit and then the result is added to zero to make
|
|
it into a number. If there is only one digit in the option, then
|
|
@code{Optarg} is not needed. In this case, @code{Optind} must be decremented so that
|
|
@code{getopt} processes it next time. This code is admittedly a bit
|
|
tricky.
|
|
|
|
If no options are supplied, then the default is taken, to print both
|
|
repeated and nonrepeated lines. The output file, if provided, is assigned
|
|
to @code{outputfile}. Early on, @code{outputfile} is initialized to the
|
|
standard output, @file{/dev/stdout}:
|
|
|
|
@cindex @code{uniq.awk} program
|
|
@example
|
|
@c file eg/prog/uniq.awk
|
|
@group
|
|
# uniq.awk --- do uniq in awk
|
|
#
|
|
# Requires getopt and join library functions
|
|
@end group
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/uniq.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/uniq.awk
|
|
function usage( e)
|
|
@{
|
|
e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
|
|
print e > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
|
|
# -c count lines. overrides -d and -u
|
|
# -d only repeated lines
|
|
# -u only non-repeated lines
|
|
# -n skip n fields
|
|
# +n skip n characters, skip fields first
|
|
|
|
BEGIN \
|
|
@{
|
|
count = 1
|
|
outputfile = "/dev/stdout"
|
|
opts = "udc0:1:2:3:4:5:6:7:8:9:"
|
|
while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
|
|
if (c == "u")
|
|
non_repeated_only++
|
|
else if (c == "d")
|
|
repeated_only++
|
|
else if (c == "c")
|
|
do_count++
|
|
else if (index("0123456789", c) != 0) @{
|
|
# getopt requires args to options
|
|
# this messes us up for things like -5
|
|
if (Optarg ~ /^[0-9]+$/)
|
|
fcount = (c Optarg) + 0
|
|
else @{
|
|
fcount = c + 0
|
|
Optind--
|
|
@}
|
|
@} else
|
|
usage()
|
|
@}
|
|
|
|
if (ARGV[Optind] ~ /^\+[0-9]+$/) @{
|
|
charcount = substr(ARGV[Optind], 2) + 0
|
|
Optind++
|
|
@}
|
|
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
|
|
if (repeated_only == 0 && non_repeated_only == 0)
|
|
repeated_only = non_repeated_only = 1
|
|
|
|
if (ARGC - Optind == 2) @{
|
|
outputfile = ARGV[ARGC - 1]
|
|
ARGV[ARGC - 1] = ""
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The following function, @code{are_equal}, compares the current line,
|
|
@code{$0}, to the
|
|
previous line, @code{last}. It handles skipping fields and characters.
|
|
If no field count and no character count are specified, @code{are_equal}
|
|
simply returns one or zero depending upon the result of a simple string
|
|
comparison of @code{last} and @code{$0}. Otherwise, things get more
|
|
complicated.
|
|
If fields have to be skipped, each line is broken into an array using
|
|
@code{split}
|
|
(@pxref{String Functions});
|
|
the desired fields are then joined back into a line using @code{join}.
|
|
The joined lines are stored in @code{clast} and @code{cline}.
|
|
If no fields are skipped, @code{clast} and @code{cline} are set to
|
|
@code{last} and @code{$0}, respectively.
|
|
Finally, if characters are skipped, @code{substr} is used to strip off the
|
|
leading @code{charcount} characters in @code{clast} and @code{cline}. The
|
|
two strings are then compared and @code{are_equal} returns the result:
|
|
|
|
@example
|
|
@c file eg/prog/uniq.awk
|
|
function are_equal( n, m, clast, cline, alast, aline)
|
|
@{
|
|
if (fcount == 0 && charcount == 0)
|
|
return (last == $0)
|
|
|
|
if (fcount > 0) @{
|
|
n = split(last, alast)
|
|
m = split($0, aline)
|
|
clast = join(alast, fcount+1, n)
|
|
cline = join(aline, fcount+1, m)
|
|
@} else @{
|
|
clast = last
|
|
cline = $0
|
|
@}
|
|
if (charcount) @{
|
|
clast = substr(clast, charcount + 1)
|
|
cline = substr(cline, charcount + 1)
|
|
@}
|
|
|
|
return (clast == cline)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The following two rules are the body of the program. The first one is
|
|
executed only for the very first line of data. It sets @code{last} equal to
|
|
@code{$0}, so that subsequent lines of text have something to be compared to.
|
|
|
|
The second rule does the work. The variable @code{equal} is one or zero,
|
|
depending upon the results of @code{are_equal}'s comparison. If @command{uniq}
|
|
is counting repeated lines, and the lines are equal, then it increments the @code{count} variable.
|
|
Otherwise, it prints the line and resets @code{count},
|
|
since the two lines are not equal.
|
|
|
|
If @command{uniq} is not counting, and if the lines are equal, @code{count} is incremented.
|
|
Nothing is printed, since the point is to remove duplicates.
|
|
Otherwise, if @command{uniq} is counting repeated lines and more than
|
|
one line is seen, or if @command{uniq} is counting nonrepeated lines
|
|
and only one line is seen, then the line is printed, and @code{count}
|
|
is reset.
|
|
|
|
Finally, similar logic is used in the @code{END} rule to print the final
|
|
line of input data:
|
|
|
|
@example
|
|
@c file eg/prog/uniq.awk
|
|
NR == 1 @{
|
|
last = $0
|
|
next
|
|
@}
|
|
|
|
@{
|
|
equal = are_equal()
|
|
|
|
if (do_count) @{ # overrides -d and -u
|
|
if (equal)
|
|
count++
|
|
else @{
|
|
printf("%4d %s\n", count, last) > outputfile
|
|
last = $0
|
|
count = 1 # reset
|
|
@}
|
|
next
|
|
@}
|
|
|
|
if (equal)
|
|
count++
|
|
else @{
|
|
if ((repeated_only && count > 1) ||
|
|
(non_repeated_only && count == 1))
|
|
print last > outputfile
|
|
last = $0
|
|
count = 1
|
|
@}
|
|
@}
|
|
|
|
END @{
|
|
if (do_count)
|
|
printf("%4d %s\n", count, last) > outputfile
|
|
else if ((repeated_only && count > 1) ||
|
|
(non_repeated_only && count == 1))
|
|
print last > outputfile
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@c ENDOFRANGE prunt
|
|
@c ENDOFRANGE tpul
|
|
|
|
@node Wc Program
|
|
@subsection Counting Things
|
|
|
|
@c STARTOFRANGE count
|
|
@cindex counting
|
|
@c STARTOFRANGE infco
|
|
@cindex input files, counting elements in
|
|
@c STARTOFRANGE woco
|
|
@cindex words, counting
|
|
@c STARTOFRANGE chco
|
|
@cindex characters, counting
|
|
@c STARTOFRANGE lico
|
|
@cindex lines, counting
|
|
@cindex @command{wc} utility
|
|
The @command{wc} (word count) utility counts lines, words, and characters in
|
|
one or more input files. Its usage is as follows:
|
|
|
|
@example
|
|
wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]}
|
|
@end example
|
|
|
|
If no files are specified on the command line, @command{wc} reads its standard
|
|
input. If there are multiple files, it also prints total counts for all
|
|
the files. The options and their meanings are shown in the following list:
|
|
|
|
@table @code
|
|
@item -l
|
|
Count only lines.
|
|
|
|
@item -w
|
|
Count only words.
|
|
A ``word'' is a contiguous sequence of nonwhitespace characters, separated
|
|
by spaces and/or tabs. Luckily, this is the normal way @command{awk} separates
|
|
fields in its input data.
|
|
|
|
@item -c
|
|
Count only characters.
|
|
@end table
|
|
|
|
Implementing @command{wc} in @command{awk} is particularly elegant,
|
|
since @command{awk} does a lot of the work for us; it splits lines into
|
|
words (i.e., fields) and counts them, it counts lines (i.e., records),
|
|
and it can easily tell us how long a line is.
|
|
|
|
This uses the @code{getopt} library function
|
|
(@pxref{Getopt Function})
|
|
and the file-transition functions
|
|
(@pxref{Filetrans Function}).
|
|
|
|
This version has one notable difference from traditional versions of
|
|
@command{wc}: it always prints the counts in the order lines, words,
|
|
and characters. Traditional versions note the order of the @option{-l},
|
|
@option{-w}, and @option{-c} options on the command line, and print the
|
|
counts in that order.
|
|
|
|
The @code{BEGIN} rule does the argument processing. The variable
|
|
@code{print_total} is true if more than one file is named on the
|
|
command line:
|
|
|
|
@cindex @code{wc.awk} program
|
|
@example
|
|
@c file eg/prog/wc.awk
|
|
# wc.awk --- count lines, words, characters
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/wc.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/wc.awk
|
|
|
|
# Options:
|
|
# -l only count lines
|
|
# -w only count words
|
|
# -c only count characters
|
|
#
|
|
# Default is to count lines, words, characters
|
|
#
|
|
# Requires getopt and file transition library functions
|
|
|
|
BEGIN @{
|
|
# let getopt print a message about
|
|
# invalid options. we ignore them
|
|
while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
|
|
if (c == "l")
|
|
do_lines = 1
|
|
else if (c == "w")
|
|
do_words = 1
|
|
else if (c == "c")
|
|
do_chars = 1
|
|
@}
|
|
for (i = 1; i < Optind; i++)
|
|
ARGV[i] = ""
|
|
|
|
# if no options, do all
|
|
if (! do_lines && ! do_words && ! do_chars)
|
|
do_lines = do_words = do_chars = 1
|
|
|
|
print_total = (ARGC - i > 2)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{beginfile} function is simple; it just resets the counts of lines,
|
|
words, and characters to zero, and saves the current @value{FN} in
|
|
@code{fname}:
|
|
|
|
@c NEXT ED: make it lines = words = chars = 0
|
|
@example
|
|
@c file eg/prog/wc.awk
|
|
function beginfile(file)
|
|
@{
|
|
chars = lines = words = 0
|
|
fname = FILENAME
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The @code{endfile} function adds the current file's numbers to the running
|
|
totals of lines, words, and characters.@footnote{@command{wc} can't just use the value of
|
|
@code{FNR} in @code{endfile}. If you examine
|
|
the code in
|
|
@ref{Filetrans Function}
|
|
you will see that
|
|
@code{FNR} has already been reset by the time
|
|
@code{endfile} is called.} It then prints out those numbers
|
|
for the file that was just read. It relies on @code{beginfile} to reset the
|
|
numbers for the following @value{DF}:
|
|
@c ONE DAY: make the above footnote an exercise, instead of giving away the answer.
|
|
|
|
@c NEXT ED: make order for += be lines, words, chars
|
|
@example
|
|
@c file eg/prog/wc.awk
|
|
function endfile(file)
|
|
@{
|
|
tchars += chars
|
|
tlines += lines
|
|
twords += words
|
|
if (do_lines)
|
|
printf "\t%d", lines
|
|
@group
|
|
if (do_words)
|
|
printf "\t%d", words
|
|
@end group
|
|
if (do_chars)
|
|
printf "\t%d", chars
|
|
printf "\t%s\n", fname
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
There is one rule that is executed for each line. It adds the length of
|
|
the record, plus one, to @code{chars}. Adding one plus the record length
|
|
is needed because the newline character separating records (the value
|
|
of @code{RS}) is not part of the record itself, and thus not included
|
|
in its length. Next, @code{lines} is incremented for each line read,
|
|
and @code{words} is incremented by the value of @code{NF}, which is the
|
|
number of ``words'' on this line:
|
|
|
|
@example
|
|
@c file eg/prog/wc.awk
|
|
# do per line
|
|
@{
|
|
chars += length($0) + 1 # get newline
|
|
lines++
|
|
words += NF
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
Finally, the @code{END} rule simply prints the totals for all the files:
|
|
|
|
@example
|
|
@c file eg/prog/wc.awk
|
|
END @{
|
|
if (print_total) @{
|
|
if (do_lines)
|
|
printf "\t%d", tlines
|
|
if (do_words)
|
|
printf "\t%d", twords
|
|
if (do_chars)
|
|
printf "\t%d", tchars
|
|
print "\ttotal"
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@c ENDOFRANGE count
|
|
@c ENDOFRANGE infco
|
|
@c ENDOFRANGE lico
|
|
@c ENDOFRANGE woco
|
|
@c ENDOFRANGE chco
|
|
@c ENDOFRANGE posimawk
|
|
|
|
@node Miscellaneous Programs
|
|
@section A Grab Bag of @command{awk} Programs
|
|
|
|
This @value{SECTION} is a large ``grab bag'' of miscellaneous programs.
|
|
We hope you find them both interesting and enjoyable.
|
|
|
|
@menu
|
|
* Dupword Program:: Finding duplicated words in a document.
|
|
* Alarm Program:: An alarm clock.
|
|
* Translate Program:: A program similar to the @command{tr} utility.
|
|
* Labels Program:: Printing mailing labels.
|
|
* Word Sorting:: A program to produce a word usage count.
|
|
* History Sorting:: Eliminating duplicate entries from a history
|
|
file.
|
|
* Extract Program:: Pulling out programs from Texinfo source
|
|
files.
|
|
* Simple Sed:: A Simple Stream Editor.
|
|
* Igawk Program:: A wrapper for @command{awk} that includes
|
|
files.
|
|
@end menu
|
|
|
|
@node Dupword Program
|
|
@subsection Finding Duplicated Words in a Document
|
|
|
|
@c last comma is part of secondary
|
|
@cindex words, duplicate, searching for
|
|
@cindex searching, for words
|
|
@c first comma is part of primary
|
|
@cindex documents, searching
|
|
A common error when writing large amounts of prose is to accidentally
|
|
duplicate words. Typically you will see this in text as something like ``the
|
|
the program does the following@dots{}'' When the text is online, often
|
|
the duplicated words occur at the end of one line and the beginning of
|
|
another, making them very difficult to spot.
|
|
@c as here!
|
|
|
|
This program, @file{dupword.awk}, scans through a file one line at a time
|
|
and looks for adjacent occurrences of the same word. It also saves the last
|
|
word on a line (in the variable @code{prev}) for comparison with the first
|
|
word on the next line.
|
|
|
|
@cindex Texinfo
|
|
The first two statements make sure that the line is all lowercase,
|
|
so that, for example, ``The'' and ``the'' compare equal to each other.
|
|
The next statement replaces nonalphanumeric and nonwhitespace characters
|
|
with spaces, so that punctuation does not affect the comparison either.
|
|
The characters are replaced with spaces so that formatting controls
|
|
don't create nonsense words (e.g., the Texinfo @samp{@@code@{NF@}}
|
|
becomes @samp{codeNF} if punctuation is simply deleted). The record is
|
|
then resplit into fields, yielding just the actual words on the line,
|
|
and ensuring that there are no empty fields.
|
|
|
|
If there are no fields left after removing all the punctuation, the
|
|
current record is skipped. Otherwise, the program loops through each
|
|
word, comparing it to the previous one:
|
|
|
|
@cindex @code{dupword.awk} program
|
|
@example
|
|
@c file eg/prog/dupword.awk
|
|
# dupword.awk --- find duplicate words in text
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/dupword.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# December 1991
|
|
# Revised October 2000
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/dupword.awk
|
|
@{
|
|
$0 = tolower($0)
|
|
gsub(/[^[:alnum:][:blank:]]/, " ");
|
|
$0 = $0 # re-split
|
|
if (NF == 0)
|
|
next
|
|
if ($1 == prev)
|
|
printf("%s:%d: duplicate %s\n",
|
|
FILENAME, FNR, $1)
|
|
for (i = 2; i <= NF; i++)
|
|
if ($i == $(i-1))
|
|
printf("%s:%d: duplicate %s\n",
|
|
FILENAME, FNR, $i)
|
|
prev = $NF
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@node Alarm Program
|
|
@subsection An Alarm Clock Program
|
|
@cindex insomnia, cure for
|
|
@cindex Robbins, Arnold
|
|
@quotation
|
|
@i{Nothing cures insomnia like a ringing alarm clock.}@*
|
|
Arnold Robbins
|
|
@end quotation
|
|
|
|
@c STARTOFRANGE tialarm
|
|
@cindex time, alarm clock example program
|
|
@c STARTOFRANGE alaex
|
|
@cindex alarm clock example program
|
|
The following program is a simple ``alarm clock'' program.
|
|
You give it a time of day and an optional message. At the specified time,
|
|
it prints the message on the standard output. In addition, you can give it
|
|
the number of times to repeat the message as well as a delay between
|
|
repetitions.
|
|
|
|
This program uses the @code{gettimeofday} function from
|
|
@ref{Gettimeofday Function}.
|
|
|
|
All the work is done in the @code{BEGIN} rule. The first part is argument
|
|
checking and setting of defaults: the delay, the count, and the message to
|
|
print. If the user supplied a message without the ASCII BEL
|
|
character (known as the ``alert'' character, @code{"\a"}), then it is added to
|
|
the message. (On many systems, printing the ASCII BEL generates an
|
|
audible alert. Thus when the alarm goes off, the system calls attention
|
|
to itself in case the user is not looking at the computer or terminal.)
|
|
Here is the program:
|
|
|
|
@cindex @code{alarm.awk} program
|
|
@example
|
|
@c file eg/prog/alarm.awk
|
|
# alarm.awk --- set an alarm
|
|
#
|
|
# Requires gettimeofday library function
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/alarm.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/alarm.awk
|
|
# usage: alarm time [ "message" [ count [ delay ] ] ]
|
|
|
|
BEGIN \
|
|
@{
|
|
# Initial argument sanity checking
|
|
usage1 = "usage: alarm time ['message' [count [delay]]]"
|
|
usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
|
|
|
|
if (ARGC < 2) @{
|
|
print usage1 > "/dev/stderr"
|
|
print usage2 > "/dev/stderr"
|
|
exit 1
|
|
@} else if (ARGC == 5) @{
|
|
delay = ARGV[4] + 0
|
|
count = ARGV[3] + 0
|
|
message = ARGV[2]
|
|
@} else if (ARGC == 4) @{
|
|
count = ARGV[3] + 0
|
|
message = ARGV[2]
|
|
@} else if (ARGC == 3) @{
|
|
message = ARGV[2]
|
|
@} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{
|
|
print usage1 > "/dev/stderr"
|
|
print usage2 > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
|
|
# set defaults for once we reach the desired time
|
|
if (delay == 0)
|
|
delay = 180 # 3 minutes
|
|
@group
|
|
if (count == 0)
|
|
count = 5
|
|
@end group
|
|
if (message == "")
|
|
message = sprintf("\aIt is now %s!\a", ARGV[1])
|
|
else if (index(message, "\a") == 0)
|
|
message = "\a" message "\a"
|
|
@c endfile
|
|
@end example
|
|
|
|
The next @value{SECTION} of code turns the alarm time into hours and minutes,
|
|
converts it (if necessary) to a 24-hour clock, and then turns that
|
|
time into a count of the seconds since midnight. Next it turns the current
|
|
time into a count of seconds since midnight. The difference between the two
|
|
is how long to wait before setting off the alarm:
|
|
|
|
@example
|
|
@c file eg/prog/alarm.awk
|
|
# split up alarm time
|
|
split(ARGV[1], atime, ":")
|
|
hour = atime[1] + 0 # force numeric
|
|
minute = atime[2] + 0 # force numeric
|
|
|
|
# get current broken down time
|
|
gettimeofday(now)
|
|
|
|
# if time given is 12-hour hours and it's after that
|
|
# hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
|
|
# then add 12 to real hour
|
|
if (hour < 12 && now["hour"] > hour)
|
|
hour += 12
|
|
|
|
# set target time in seconds since midnight
|
|
target = (hour * 60 * 60) + (minute * 60)
|
|
|
|
# get current time in seconds since midnight
|
|
current = (now["hour"] * 60 * 60) + \
|
|
(now["minute"] * 60) + now["second"]
|
|
|
|
# how long to sleep for
|
|
naptime = target - current
|
|
if (naptime <= 0) @{
|
|
print "time is in the past!" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@cindex @command{sleep} utility
|
|
Finally, the program uses the @code{system} function
|
|
(@pxref{I/O Functions})
|
|
to call the @command{sleep} utility. The @command{sleep} utility simply pauses
|
|
for the given number of seconds. If the exit status is not zero,
|
|
the program assumes that @command{sleep} was interrupted and exits. If
|
|
@command{sleep} exited with an OK status (zero), then the program prints the
|
|
message in a loop, again using @command{sleep} to delay for however many
|
|
seconds are necessary:
|
|
|
|
@example
|
|
@c file eg/prog/alarm.awk
|
|
# zzzzzz..... go away if interrupted
|
|
if (system(sprintf("sleep %d", naptime)) != 0)
|
|
exit 1
|
|
|
|
# time to notify!
|
|
command = sprintf("sleep %d", delay)
|
|
for (i = 1; i <= count; i++) @{
|
|
print message
|
|
# if sleep command interrupted, go away
|
|
if (system(command) != 0)
|
|
break
|
|
@}
|
|
|
|
exit 0
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@c ENDOFRANGE tialarm
|
|
@c ENDOFRANGE alaex
|
|
|
|
@node Translate Program
|
|
@subsection Transliterating Characters
|
|
|
|
@c STARTOFRANGE chtra
|
|
@cindex characters, transliterating
|
|
@cindex @command{tr} utility
|
|
The system @command{tr} utility transliterates characters. For example, it is
|
|
often used to map uppercase letters into lowercase for further processing:
|
|
|
|
@example
|
|
@var{generate data} | tr 'A-Z' 'a-z' | @var{process data} @dots{}
|
|
@end example
|
|
|
|
@command{tr} requires two lists of characters.@footnote{On some older
|
|
System V systems,
|
|
@ifset ORA
|
|
including Solaris,
|
|
@end ifset
|
|
@command{tr} may require that the lists be written as
|
|
range expressions enclosed in square brackets (@samp{[a-z]}) and quoted,
|
|
to prevent the shell from attempting a @value{FN} expansion. This is
|
|
not a feature.} When processing the input, the first character in the
|
|
first list is replaced with the first character in the second list,
|
|
the second character in the first list is replaced with the second
|
|
character in the second list, and so on. If there are more characters
|
|
in the ``from'' list than in the ``to'' list, the last character of the
|
|
``to'' list is used for the remaining characters in the ``from'' list.
|
|
|
|
Some time ago,
|
|
@c early or mid-1989!
|
|
a user proposed that a transliteration function should
|
|
be added to @command{gawk}.
|
|
@c Wishing to avoid gratuitous new features,
|
|
@c at least theoretically
|
|
The following program was written to
|
|
prove that character transliteration could be done with a user-level
|
|
function. This program is not as complete as the system @command{tr} utility
|
|
but it does most of the job.
|
|
|
|
The @command{translate} program demonstrates one of the few weaknesses
|
|
of standard @command{awk}: dealing with individual characters is very
|
|
painful, requiring repeated use of the @code{substr}, @code{index},
|
|
and @code{gsub} built-in functions
|
|
(@pxref{String Functions}).@footnote{This
|
|
program was written before @command{gawk} acquired the ability to
|
|
split each character in a string into separate array elements.}
|
|
@c Exercise: How might you use this new feature to simplify the program?
|
|
There are two functions. The first, @code{stranslate}, takes three
|
|
arguments:
|
|
|
|
@table @code
|
|
@item from
|
|
A list of characters from which to translate.
|
|
|
|
@item to
|
|
A list of characters to which to translate.
|
|
|
|
@item target
|
|
The string on which to do the translation.
|
|
@end table
|
|
|
|
Associative arrays make the translation part fairly easy. @code{t_ar} holds
|
|
the ``to'' characters, indexed by the ``from'' characters. Then a simple
|
|
loop goes through @code{from}, one character at a time. For each character
|
|
in @code{from}, if the character appears in @code{target}, @code{gsub}
|
|
is used to change it to the corresponding @code{to} character.
|
|
|
|
The @code{translate} function simply calls @code{stranslate} using @code{$0}
|
|
as the target. The main program sets two global variables, @code{FROM} and
|
|
@code{TO}, from the command line, and then changes @code{ARGV} so that
|
|
@command{awk} reads from the standard input.
|
|
|
|
Finally, the processing rule simply calls @code{translate} for each record:
|
|
|
|
@cindex @code{translate.awk} program
|
|
@example
|
|
@c file eg/prog/translate.awk
|
|
# translate.awk --- do tr-like stuff
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/translate.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# August 1989
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/translate.awk
|
|
# Bugs: does not handle things like: tr A-Z a-z, it has
|
|
# to be spelled out. However, if `to' is shorter than `from',
|
|
# the last character in `to' is used for the rest of `from'.
|
|
|
|
function stranslate(from, to, target, lf, lt, t_ar, i, c)
|
|
@{
|
|
lf = length(from)
|
|
lt = length(to)
|
|
for (i = 1; i <= lt; i++)
|
|
t_ar[substr(from, i, 1)] = substr(to, i, 1)
|
|
if (lt < lf)
|
|
for (; i <= lf; i++)
|
|
t_ar[substr(from, i, 1)] = substr(to, lt, 1)
|
|
for (i = 1; i <= lf; i++) @{
|
|
c = substr(from, i, 1)
|
|
if (index(target, c) > 0)
|
|
gsub(c, t_ar[c], target)
|
|
@}
|
|
return target
|
|
@}
|
|
|
|
function translate(from, to)
|
|
@{
|
|
return $0 = stranslate(from, to, $0)
|
|
@}
|
|
|
|
# main program
|
|
BEGIN @{
|
|
@group
|
|
if (ARGC < 3) @{
|
|
print "usage: translate from to" > "/dev/stderr"
|
|
exit
|
|
@}
|
|
@end group
|
|
FROM = ARGV[1]
|
|
TO = ARGV[2]
|
|
ARGC = 2
|
|
ARGV[1] = "-"
|
|
@}
|
|
|
|
@{
|
|
translate(FROM, TO)
|
|
print
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
While it is possible to do character transliteration in a user-level
|
|
function, it is not necessarily efficient, and we (the @command{gawk}
|
|
authors) started to consider adding a built-in function. However,
|
|
shortly after writing this program, we learned that the System V Release 4
|
|
@command{awk} had added the @code{toupper} and @code{tolower} functions
|
|
(@pxref{String Functions}).
|
|
These functions handle the vast majority of the
|
|
cases where character transliteration is necessary, and so we chose to
|
|
simply add those functions to @command{gawk} as well and then leave well
|
|
enough alone.
|
|
|
|
An obvious improvement to this program would be to set up the
|
|
@code{t_ar} array only once, in a @code{BEGIN} rule. However, this
|
|
assumes that the ``from'' and ``to'' lists
|
|
will never change throughout the lifetime of the program.
|
|
@c ENDOFRANGE chtra
|
|
|
|
@node Labels Program
|
|
@subsection Printing Mailing Labels
|
|
|
|
@c STARTOFRANGE prml
|
|
@cindex printing, mailing labels
|
|
@c comma is part of primary
|
|
@c STARTOFRANGE mlprint
|
|
@cindex mailing labels, printing
|
|
Here is a ``real world''@footnote{``Real world'' is defined as
|
|
``a program actually used to get something done.''}
|
|
program. This
|
|
script reads lists of names and
|
|
addresses and generates mailing labels. Each page of labels has 20 labels
|
|
on it, 2 across and 10 down. The addresses are guaranteed to be no more
|
|
than 5 lines of data. Each address is separated from the next by a blank
|
|
line.
|
|
|
|
The basic idea is to read 20 labels worth of data. Each line of each label
|
|
is stored in the @code{line} array. The single rule takes care of filling
|
|
the @code{line} array and printing the page when 20 labels have been read.
|
|
|
|
The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
|
|
@command{awk} splits records at blank lines
|
|
(@pxref{Records}).
|
|
It sets @code{MAXLINES} to 100, since 100 is the maximum number
|
|
of lines on the page (20 * 5 = 100).
|
|
|
|
Most of the work is done in the @code{printpage} function.
|
|
The label lines are stored sequentially in the @code{line} array. But they
|
|
have to print horizontally; @code{line[1]} next to @code{line[6]},
|
|
@code{line[2]} next to @code{line[7]}, and so on. Two loops are used to
|
|
accomplish this. The outer loop, controlled by @code{i}, steps through
|
|
every 10 lines of data; this is each row of labels. The inner loop,
|
|
controlled by @code{j}, goes through the lines within the row.
|
|
As @code{j} goes from 0 to 4, @samp{i+j} is the @code{j}-th line in
|
|
the row, and @samp{i+j+5} is the entry next to it. The output ends up
|
|
looking something like this:
|
|
|
|
@example
|
|
line 1 line 6
|
|
line 2 line 7
|
|
line 3 line 8
|
|
line 4 line 9
|
|
line 5 line 10
|
|
@dots{}
|
|
@end example
|
|
|
|
As a final note, an extra blank line is printed at lines 21 and 61, to keep
|
|
the output lined up on the labels. This is dependent on the particular
|
|
brand of labels in use when the program was written. You will also note
|
|
that there are 2 blank lines at the top and 2 blank lines at the bottom.
|
|
|
|
The @code{END} rule arranges to flush the final page of labels; there may
|
|
not have been an even multiple of 20 labels in the data:
|
|
|
|
@cindex @code{labels.awk} program
|
|
@example
|
|
@c file eg/prog/labels.awk
|
|
# labels.awk --- print mailing labels
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/labels.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# June 1992
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/labels.awk
|
|
|
|
# Each label is 5 lines of data that may have blank lines.
|
|
# The label sheets have 2 blank lines at the top and 2 at
|
|
# the bottom.
|
|
|
|
BEGIN @{ RS = "" ; MAXLINES = 100 @}
|
|
|
|
function printpage( i, j)
|
|
@{
|
|
if (Nlines <= 0)
|
|
return
|
|
|
|
printf "\n\n" # header
|
|
|
|
for (i = 1; i <= Nlines; i += 10) @{
|
|
if (i == 21 || i == 61)
|
|
print ""
|
|
for (j = 0; j < 5; j++) @{
|
|
if (i + j > MAXLINES)
|
|
break
|
|
printf " %-41s %s\n", line[i+j], line[i+j+5]
|
|
@}
|
|
print ""
|
|
@}
|
|
|
|
printf "\n\n" # footer
|
|
|
|
for (i in line)
|
|
line[i] = ""
|
|
@}
|
|
|
|
# main rule
|
|
@{
|
|
if (Count >= 20) @{
|
|
printpage()
|
|
Count = 0
|
|
Nlines = 0
|
|
@}
|
|
n = split($0, a, "\n")
|
|
for (i = 1; i <= n; i++)
|
|
line[++Nlines] = a[i]
|
|
for (; i <= 5; i++)
|
|
line[++Nlines] = ""
|
|
Count++
|
|
@}
|
|
|
|
END \
|
|
@{
|
|
printpage()
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@c ENDOFRANGE prml
|
|
@c ENDOFRANGE mlprint
|
|
|
|
@node Word Sorting
|
|
@subsection Generating Word-Usage Counts
|
|
|
|
@c last comma is part of secondary
|
|
@c STARTOFRANGE worus
|
|
@cindex words, usage counts, generating
|
|
@c NEXT ED: Rewrite this whole section and example
|
|
The following @command{awk} program prints
|
|
the number of occurrences of each word in its input. It illustrates the
|
|
associative nature of @command{awk} arrays by using strings as subscripts. It
|
|
also demonstrates the @samp{for @var{index} in @var{array}} mechanism.
|
|
Finally, it shows how @command{awk} is used in conjunction with other
|
|
utility programs to do a useful task of some complexity with a minimum of
|
|
effort. Some explanations follow the program listing:
|
|
|
|
@example
|
|
# Print list of word frequencies
|
|
@{
|
|
for (i = 1; i <= NF; i++)
|
|
freq[$i]++
|
|
@}
|
|
|
|
END @{
|
|
for (word in freq)
|
|
printf "%s\t%d\n", word, freq[word]
|
|
@}
|
|
@end example
|
|
|
|
@c Exercise: Use asort() here
|
|
|
|
This program has two rules. The
|
|
first rule, because it has an empty pattern, is executed for every input line.
|
|
It uses @command{awk}'s field-accessing mechanism
|
|
(@pxref{Fields}) to pick out the individual words from
|
|
the line, and the built-in variable @code{NF} (@pxref{Built-in Variables})
|
|
to know how many fields are available.
|
|
For each input word, it increments an element of the array @code{freq} to
|
|
reflect that the word has been seen an additional time.
|
|
|
|
The second rule, because it has the pattern @code{END}, is not executed
|
|
until the input has been exhausted. It prints out the contents of the
|
|
@code{freq} table that has been built up inside the first action.
|
|
This program has several problems that would prevent it from being
|
|
useful by itself on real text files:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Words are detected using the @command{awk} convention that fields are
|
|
separated just by whitespace. Other characters in the input (except
|
|
newlines) don't have any special meaning to @command{awk}. This means that
|
|
punctuation characters count as part of words.
|
|
|
|
@item
|
|
The @command{awk} language considers upper- and lowercase characters to be
|
|
distinct. Therefore, ``bartender'' and ``Bartender'' are not treated
|
|
as the same word. This is undesirable, since in normal text, words
|
|
are capitalized if they begin sentences, and a frequency analyzer should not
|
|
be sensitive to capitalization.
|
|
|
|
@item
|
|
The output does not come out in any useful order. You're more likely to be
|
|
interested in which words occur most frequently or in having an alphabetized
|
|
table of how frequently each word occurs.
|
|
@end itemize
|
|
|
|
@cindex @command{sort} utility
|
|
The way to solve these problems is to use some of @command{awk}'s more advanced
|
|
features. First, we use @code{tolower} to remove
|
|
case distinctions. Next, we use @code{gsub} to remove punctuation
|
|
characters. Finally, we use the system @command{sort} utility to process the
|
|
output of the @command{awk} script. Here is the new version of
|
|
the program:
|
|
|
|
@cindex @code{wordfreq.awk} program
|
|
@example
|
|
@c file eg/prog/wordfreq.awk
|
|
# wordfreq.awk --- print list of word frequencies
|
|
|
|
@{
|
|
$0 = tolower($0) # remove case distinctions
|
|
# remove punctuation
|
|
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
|
|
for (i = 1; i <= NF; i++)
|
|
freq[$i]++
|
|
@}
|
|
|
|
END @{
|
|
for (word in freq)
|
|
printf "%s\t%d\n", word, freq[word]
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
Assuming we have saved this program in a file named @file{wordfreq.awk},
|
|
and that the data is in @file{file1}, the following pipeline:
|
|
|
|
@example
|
|
awk -f wordfreq.awk file1 | sort -k 2nr
|
|
@end example
|
|
|
|
@noindent
|
|
produces a table of the words appearing in @file{file1} in order of
|
|
decreasing frequency. The @command{awk} program suitably massages the
|
|
data and produces a word frequency table, which is not ordered.
|
|
|
|
The @command{awk} script's output is then sorted by the @command{sort}
|
|
utility and printed on the terminal. The options given to @command{sort}
|
|
specify a sort that uses the second field of each input line (skipping
|
|
one field), that the sort keys should be treated as numeric quantities
|
|
(otherwise @samp{15} would come before @samp{5}), and that the sorting
|
|
should be done in descending (reverse) order.
|
|
|
|
The @command{sort} could even be done from within the program, by changing
|
|
the @code{END} action to:
|
|
|
|
@example
|
|
@c file eg/prog/wordfreq.awk
|
|
END @{
|
|
sort = "sort -k 2nr"
|
|
for (word in freq)
|
|
printf "%s\t%d\n", word, freq[word] | sort
|
|
close(sort)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
This way of sorting must be used on systems that do not
|
|
have true pipes at the command-line (or batch-file) level.
|
|
See the general operating system documentation for more information on how
|
|
to use the @command{sort} program.
|
|
@c ENDOFRANGE worus
|
|
|
|
@node History Sorting
|
|
@subsection Removing Duplicates from Unsorted Text
|
|
|
|
@c last comma is part of secondary
|
|
@c STARTOFRANGE lidu
|
|
@cindex lines, duplicate, removing
|
|
The @command{uniq} program
|
|
(@pxref{Uniq Program}),
|
|
removes duplicate lines from @emph{sorted} data.
|
|
|
|
Suppose, however, you need to remove duplicate lines from a @value{DF} but
|
|
that you want to preserve the order the lines are in. A good example of
|
|
this might be a shell history file. The history file keeps a copy of all
|
|
the commands you have entered, and it is not unusual to repeat a command
|
|
several times in a row. Occasionally you might want to compact the history
|
|
by removing duplicate entries. Yet it is desirable to maintain the order
|
|
of the original commands.
|
|
|
|
This simple program does the job. It uses two arrays. The @code{data}
|
|
array is indexed by the text of each line.
|
|
For each line, @code{data[$0]} is incremented.
|
|
If a particular line has not
|
|
been seen before, then @code{data[$0]} is zero.
|
|
In this case, the text of the line is stored in @code{lines[count]}.
|
|
Each element of @code{lines} is a unique command, and the indices of
|
|
@code{lines} indicate the order in which those lines are encountered.
|
|
The @code{END} rule simply prints out the lines, in order:
|
|
|
|
@cindex Rakitzis, Byron
|
|
@cindex @code{histsort.awk} program
|
|
@example
|
|
@c file eg/prog/histsort.awk
|
|
# histsort.awk --- compact a shell history file
|
|
# Thanks to Byron Rakitzis for the general idea
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/histsort.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/histsort.awk
|
|
@group
|
|
@{
|
|
if (data[$0]++ == 0)
|
|
lines[++count] = $0
|
|
@}
|
|
@end group
|
|
|
|
END @{
|
|
for (i = 1; i <= count; i++)
|
|
print lines[i]
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
This program also provides a foundation for generating other useful
|
|
information. For example, using the following @code{print} statement in the
|
|
@code{END} rule indicates how often a particular command is used:
|
|
|
|
@example
|
|
print data[lines[i]], lines[i]
|
|
@end example
|
|
|
|
This works because @code{data[$0]} is incremented each time a line is
|
|
seen.
|
|
@c ENDOFRANGE lidu
|
|
|
|
@node Extract Program
|
|
@subsection Extracting Programs from Texinfo Source Files
|
|
|
|
@c STARTOFRANGE texse
|
|
@cindex Texinfo, extracting programs from source files
|
|
@c last comma is part of secondary
|
|
@c STARTOFRANGE fitex
|
|
@cindex files, Texinfo, extracting programs from
|
|
@ifnotinfo
|
|
Both this chapter and the previous chapter
|
|
(@ref{Library Functions})
|
|
present a large number of @command{awk} programs.
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
The nodes
|
|
@ref{Library Functions},
|
|
and @ref{Sample Programs},
|
|
are the top level nodes for a large number of @command{awk} programs.
|
|
@end ifinfo
|
|
If you want to experiment with these programs, it is tedious to have to type
|
|
them in by hand. Here we present a program that can extract parts of a
|
|
Texinfo input file into separate files.
|
|
|
|
@cindex Texinfo
|
|
This @value{DOCUMENT} is written in Texinfo, the GNU project's document
|
|
formatting
|
|
language.
|
|
A single Texinfo source file can be used to produce both
|
|
printed and online documentation.
|
|
@ifnotinfo
|
|
Texinfo is fully documented in the book
|
|
@cite{Texinfo---The GNU Documentation Format},
|
|
available from the Free Software Foundation.
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
The Texinfo language is described fully, starting with
|
|
@ref{Top}.
|
|
@end ifinfo
|
|
|
|
For our purposes, it is enough to know three things about Texinfo input
|
|
files:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The ``at'' symbol (@samp{@@}) is special in Texinfo, much as
|
|
the backslash (@samp{\}) is in C
|
|
or @command{awk}. Literal @samp{@@} symbols are represented in Texinfo source
|
|
files as @samp{@@@@}.
|
|
|
|
@item
|
|
Comments start with either @samp{@@c} or @samp{@@comment}.
|
|
The file-extraction program works by using special comments that start
|
|
at the beginning of a line.
|
|
|
|
@item
|
|
Lines containing @samp{@@group} and @samp{@@end group} commands bracket
|
|
example text that should not be split across a page boundary.
|
|
(Unfortunately, @TeX{} isn't always smart enough to do things exactly right,
|
|
and we have to give it some help.)
|
|
@end itemize
|
|
|
|
The following program, @file{extract.awk}, reads through a Texinfo source
|
|
file and does two things, based on the special comments.
|
|
Upon seeing @samp{@w{@@c system @dots{}}},
|
|
it runs a command, by extracting the command text from the
|
|
control line and passing it on to the @code{system} function
|
|
(@pxref{I/O Functions}).
|
|
Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
|
|
the file @var{filename}, until @samp{@@c endfile} is encountered.
|
|
The rules in @file{extract.awk} match either @samp{@@c} or
|
|
@samp{@@comment} by letting the @samp{omment} part be optional.
|
|
Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
|
|
@file{extract.awk} uses the @code{join} library function
|
|
(@pxref{Join Function}).
|
|
|
|
The example programs in the online Texinfo source for @cite{@value{TITLE}}
|
|
(@file{gawk.texi}) have all been bracketed inside @samp{file} and
|
|
@samp{endfile} lines. The @command{gawk} distribution uses a copy of
|
|
@file{extract.awk} to extract the sample programs and install many
|
|
of them in a standard directory where @command{gawk} can find them.
|
|
The Texinfo file looks something like this:
|
|
|
|
@example
|
|
@dots{}
|
|
This program has a @@code@{BEGIN@} rule,
|
|
that prints a nice message:
|
|
|
|
@@example
|
|
@@c file examples/messages.awk
|
|
BEGIN @@@{ print "Don't panic!" @@@}
|
|
@@c end file
|
|
@@end example
|
|
|
|
It also prints some final advice:
|
|
|
|
@@example
|
|
@@c file examples/messages.awk
|
|
END @@@{ print "Always avoid bored archeologists!" @@@}
|
|
@@c end file
|
|
@@end example
|
|
@dots{}
|
|
@end example
|
|
|
|
@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that
|
|
mixed upper- and lowercase letters in the directives won't matter.
|
|
|
|
The first rule handles calling @code{system}, checking that a command is
|
|
given (@code{NF} is at least three) and also checking that the command
|
|
exits with a zero exit status, signifying OK:
|
|
|
|
@cindex @code{extract.awk} program
|
|
@example
|
|
@c file eg/prog/extract.awk
|
|
# extract.awk --- extract files and run programs
|
|
# from texinfo files
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/extract.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# May 1993
|
|
# Revised September 2000
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/extract.awk
|
|
BEGIN @{ IGNORECASE = 1 @}
|
|
|
|
/^@@c(omment)?[ \t]+system/ \
|
|
@{
|
|
if (NF < 3) @{
|
|
e = (FILENAME ":" FNR)
|
|
e = (e ": badly formed `system' line")
|
|
print e > "/dev/stderr"
|
|
next
|
|
@}
|
|
$1 = ""
|
|
$2 = ""
|
|
stat = system($0)
|
|
if (stat != 0) @{
|
|
e = (FILENAME ":" FNR)
|
|
e = (e ": warning: system returned " stat)
|
|
print e > "/dev/stderr"
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
@noindent
|
|
The variable @code{e} is used so that the function
|
|
fits nicely on the
|
|
@ifnotinfo
|
|
page.
|
|
@end ifnotinfo
|
|
@ifnottex
|
|
screen.
|
|
@end ifnottex
|
|
|
|
The second rule handles moving data into files. It verifies that a
|
|
@value{FN} is given in the directive. If the file named is not the
|
|
current file, then the current file is closed. Keeping the current file
|
|
open until a new file is encountered allows the use of the @samp{>}
|
|
redirection for printing the contents, keeping open file management
|
|
simple.
|
|
|
|
The @samp{for} loop does the work. It reads lines using @code{getline}
|
|
(@pxref{Getline}).
|
|
For an unexpected end of file, it calls the @code{@w{unexpected_eof}}
|
|
function. If the line is an ``endfile'' line, then it breaks out of
|
|
the loop.
|
|
If the line is an @samp{@@group} or @samp{@@end group} line, then it
|
|
ignores it and goes on to the next line.
|
|
Similarly, comments within examples are also ignored.
|
|
|
|
Most of the work is in the following few lines. If the line has no @samp{@@}
|
|
symbols, the program can print it directly.
|
|
Otherwise, each leading @samp{@@} must be stripped off.
|
|
To remove the @samp{@@} symbols, the line is split into separate elements of
|
|
the array @code{a}, using the @code{split} function
|
|
(@pxref{String Functions}).
|
|
The @samp{@@} symbol is used as the separator character.
|
|
Each element of @code{a} that is empty indicates two successive @samp{@@}
|
|
symbols in the original line. For each two empty elements (@samp{@@@@} in
|
|
the original file), we have to add a single @samp{@@} symbol back in.
|
|
|
|
When the processing of the array is finished, @code{join} is called with the
|
|
value of @code{SUBSEP}, to rejoin the pieces back into a single
|
|
line. That line is then printed to the output file:
|
|
|
|
@example
|
|
@c file eg/prog/extract.awk
|
|
/^@@c(omment)?[ \t]+file/ \
|
|
@{
|
|
if (NF != 3) @{
|
|
e = (FILENAME ":" FNR ": badly formed `file' line")
|
|
print e > "/dev/stderr"
|
|
next
|
|
@}
|
|
if ($3 != curfile) @{
|
|
if (curfile != "")
|
|
close(curfile)
|
|
curfile = $3
|
|
@}
|
|
|
|
for (;;) @{
|
|
if ((getline line) <= 0)
|
|
unexpected_eof()
|
|
if (line ~ /^@@c(omment)?[ \t]+endfile/)
|
|
break
|
|
else if (line ~ /^@@(end[ \t]+)?group/)
|
|
continue
|
|
else if (line ~ /^@@c(omment+)?[ \t]+/)
|
|
continue
|
|
if (index(line, "@@") == 0) @{
|
|
print line > curfile
|
|
continue
|
|
@}
|
|
n = split(line, a, "@@")
|
|
# if a[1] == "", means leading @@,
|
|
# don't add one back in.
|
|
for (i = 2; i <= n; i++) @{
|
|
if (a[i] == "") @{ # was an @@@@
|
|
a[i] = "@@"
|
|
if (a[i+1] == "")
|
|
i++
|
|
@}
|
|
@}
|
|
print join(a, 1, n, SUBSEP) > curfile
|
|
@}
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
An important thing to note is the use of the @samp{>} redirection.
|
|
Output done with @samp{>} only opens the file once; it stays open and
|
|
subsequent output is appended to the file
|
|
(@pxref{Redirection}).
|
|
This makes it easy to mix program text and explanatory prose for the same
|
|
sample source file (as has been done here!) without any hassle. The file is
|
|
only closed when a new data @value{FN} is encountered or at the end of the
|
|
input file.
|
|
|
|
Finally, the function @code{@w{unexpected_eof}} prints an appropriate
|
|
error message and then exits.
|
|
The @code{END} rule handles the final cleanup, closing the open file:
|
|
|
|
@c function lb put on same line for page breaking. sigh
|
|
@example
|
|
@c file eg/prog/extract.awk
|
|
@group
|
|
function unexpected_eof() @{
|
|
printf("%s:%d: unexpected EOF or error\n",
|
|
FILENAME, FNR) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@end group
|
|
|
|
END @{
|
|
if (curfile)
|
|
close(curfile)
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
@c ENDOFRANGE texse
|
|
@c ENDOFRANGE fitex
|
|
|
|
@node Simple Sed
|
|
@subsection A Simple Stream Editor
|
|
|
|
@cindex @command{sed} utility
|
|
@cindex stream editors
|
|
The @command{sed} utility is a stream editor, a program that reads a
|
|
stream of data, makes changes to it, and passes it on.
|
|
It is often used to make global changes to a large file or to a stream
|
|
of data generated by a pipeline of commands.
|
|
While @command{sed} is a complicated program in its own right, its most common
|
|
use is to perform global substitutions in the middle of a pipeline:
|
|
|
|
@example
|
|
command1 < orig.data | sed 's/old/new/g' | command2 > result
|
|
@end example
|
|
|
|
Here, @samp{s/old/new/g} tells @command{sed} to look for the regexp
|
|
@samp{old} on each input line and globally replace it with the text
|
|
@samp{new}, i.e., all the occurrences on a line. This is similar to
|
|
@command{awk}'s @code{gsub} function
|
|
(@pxref{String Functions}).
|
|
|
|
The following program, @file{awksed.awk}, accepts at least two command-line
|
|
arguments: the pattern to look for and the text to replace it with. Any
|
|
additional arguments are treated as data @value{FN}s to process. If none
|
|
are provided, the standard input is used:
|
|
|
|
@cindex Brennan, Michael
|
|
@cindex @command{awksed.awk} program
|
|
@c @cindex simple stream editor
|
|
@c @cindex stream editor, simple
|
|
@example
|
|
@c file eg/prog/awksed.awk
|
|
# awksed.awk --- do s/foo/bar/g using just print
|
|
# Thanks to Michael Brennan for the idea
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/awksed.awk
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# August 1995
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/awksed.awk
|
|
function usage()
|
|
@{
|
|
print "usage: awksed pat repl [files...]" > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
|
|
BEGIN @{
|
|
# validate arguments
|
|
if (ARGC < 3)
|
|
usage()
|
|
|
|
RS = ARGV[1]
|
|
ORS = ARGV[2]
|
|
|
|
# don't use arguments as files
|
|
ARGV[1] = ARGV[2] = ""
|
|
@}
|
|
|
|
@group
|
|
# look ma, no hands!
|
|
@{
|
|
if (RT == "")
|
|
printf "%s", $0
|
|
else
|
|
print
|
|
@}
|
|
@end group
|
|
@c endfile
|
|
@end example
|
|
|
|
The program relies on @command{gawk}'s ability to have @code{RS} be a regexp,
|
|
as well as on the setting of @code{RT} to the actual text that terminates the
|
|
record (@pxref{Records}).
|
|
|
|
The idea is to have @code{RS} be the pattern to look for. @command{gawk}
|
|
automatically sets @code{$0} to the text between matches of the pattern.
|
|
This is text that we want to keep, unmodified. Then, by setting @code{ORS}
|
|
to the replacement text, a simple @code{print} statement outputs the
|
|
text we want to keep, followed by the replacement text.
|
|
|
|
There is one wrinkle to this scheme, which is what to do if the last record
|
|
doesn't end with text that matches @code{RS}. Using a @code{print}
|
|
statement unconditionally prints the replacement text, which is not correct.
|
|
However, if the file did not end in text that matches @code{RS}, @code{RT}
|
|
is set to the null string. In this case, we can print @code{$0} using
|
|
@code{printf}
|
|
(@pxref{Printf}).
|
|
|
|
The @code{BEGIN} rule handles the setup, checking for the right number
|
|
of arguments and calling @code{usage} if there is a problem. Then it sets
|
|
@code{RS} and @code{ORS} from the command-line arguments and sets
|
|
@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they are
|
|
not treated as @value{FN}s
|
|
(@pxref{ARGC and ARGV}).
|
|
|
|
The @code{usage} function prints an error message and exits.
|
|
Finally, the single rule handles the printing scheme outlined above,
|
|
using @code{print} or @code{printf} as appropriate, depending upon the
|
|
value of @code{RT}.
|
|
|
|
@ignore
|
|
Exercise, compare the performance of this version with the more
|
|
straightforward:
|
|
|
|
BEGIN {
|
|
pat = ARGV[1]
|
|
repl = ARGV[2]
|
|
ARGV[1] = ARGV[2] = ""
|
|
}
|
|
|
|
{ gsub(pat, repl); print }
|
|
|
|
Exercise: what are the advantages and disadvantages of this version versus sed?
|
|
Advantage: egrep regexps
|
|
speed (?)
|
|
Disadvantage: no & in replacement text
|
|
|
|
Others?
|
|
@end ignore
|
|
|
|
@node Igawk Program
|
|
@subsection An Easy Way to Use Library Functions
|
|
|
|
@c STARTOFRANGE libfex
|
|
@cindex libraries of @command{awk} functions, example program for using
|
|
@c STARTOFRANGE flibex
|
|
@cindex functions, library, example program for using
|
|
Using library functions in @command{awk} can be very beneficial. It
|
|
encourages code reuse and the writing of general functions. Programs are
|
|
smaller and therefore clearer.
|
|
However, using library functions is only easy when writing @command{awk}
|
|
programs; it is painful when running them, requiring multiple @option{-f}
|
|
options. If @command{gawk} is unavailable, then so too is the @env{AWKPATH}
|
|
environment variable and the ability to put @command{awk} functions into a
|
|
library directory (@pxref{Options}).
|
|
It would be nice to be able to write programs in the following manner:
|
|
|
|
@example
|
|
# library functions
|
|
@@include getopt.awk
|
|
@@include join.awk
|
|
@dots{}
|
|
|
|
# main program
|
|
BEGIN @{
|
|
while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
|
|
@dots{}
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
The following program, @file{igawk.sh}, provides this service.
|
|
It simulates @command{gawk}'s searching of the @env{AWKPATH} variable
|
|
and also allows @dfn{nested} includes; i.e., a file that is included
|
|
with @samp{@@include} can contain further @samp{@@include} statements.
|
|
@command{igawk} makes an effort to only include files once, so that nested
|
|
includes don't accidentally include a library function twice.
|
|
|
|
@command{igawk} should behave just like @command{gawk} externally. This
|
|
means it should accept all of @command{gawk}'s command-line arguments,
|
|
including the ability to have multiple source files specified via
|
|
@option{-f}, and the ability to mix command-line and library source files.
|
|
|
|
The program is written using the POSIX Shell (@command{sh}) command
|
|
language.@footnote{Fully explaining the @command{sh} language is beyond
|
|
the scope of this book. We provide some minimal explanations, but see
|
|
a good shell programming book if you wish to understand things in more
|
|
depth.} It works as follows:
|
|
|
|
@enumerate
|
|
@item
|
|
Loop through the arguments, saving anything that doesn't represent
|
|
@command{awk} source code for later, when the expanded program is run.
|
|
|
|
@item
|
|
For any arguments that do represent @command{awk} text, put the arguments into
|
|
a shell variable that will be expanded. There are two cases:
|
|
|
|
@enumerate a
|
|
@item
|
|
Literal text, provided with @option{--source} or @option{--source=}. This
|
|
text is just appended directly.
|
|
|
|
@item
|
|
Source @value{FN}s, provided with @option{-f}. We use a neat trick and append
|
|
@samp{@@include @var{filename}} to the shell variable's contents. Since the file-inclusion
|
|
program works the way @command{gawk} does, this gets the text
|
|
of the file included into the program at the correct point.
|
|
@end enumerate
|
|
|
|
@item
|
|
Run an @command{awk} program (naturally) over the shell variable's contents to expand
|
|
@samp{@@include} statements. The expanded program is placed in a second
|
|
shell variable.
|
|
|
|
@item
|
|
Run the expanded program with @command{gawk} and any other original command-line
|
|
arguments that the user supplied (such as the data @value{FN}s).
|
|
@end enumerate
|
|
|
|
This program uses shell variables extensively; for storing command line arguments,
|
|
the text of the @command{awk} program that will expand the user's program, for the
|
|
user's original program, and for the expanded program. Doing so removes some
|
|
potential problems that might arise were we to use temporary files instead,
|
|
at the cost of making the script somewhat more complicated.
|
|
|
|
The initial part of the program turns on shell tracing if the first
|
|
argument is @samp{debug}.
|
|
|
|
The next part loops through all the command-line arguments.
|
|
There are several cases of interest:
|
|
|
|
@table @code
|
|
@item --
|
|
This ends the arguments to @command{igawk}. Anything else should be passed on
|
|
to the user's @command{awk} program without being evaluated.
|
|
|
|
@item -W
|
|
This indicates that the next option is specific to @command{gawk}. To make
|
|
argument processing easier, the @option{-W} is appended to the front of the
|
|
remaining arguments and the loop continues. (This is an @command{sh}
|
|
programming trick. Don't worry about it if you are not familiar with
|
|
@command{sh}.)
|
|
|
|
@item -v@r{,} -F
|
|
These are saved and passed on to @command{gawk}.
|
|
|
|
@item -f@r{,} --file@r{,} --file=@r{,} -Wfile=
|
|
The @value{FN} is appended to the shell variable @code{program} with an
|
|
@samp{@@include} statement.
|
|
The @command{expr} utility is used to remove the leading option part of the
|
|
argument (e.g., @samp{--file=}).
|
|
(Typical @command{sh} usage would be to use the @command{echo} and @command{sed}
|
|
utilities to do this work. Unfortunately, some versions of @command{echo} evaluate
|
|
escape sequences in their arguments, possibly mangling the program text.
|
|
Using @command{expr} avoids this problem.)
|
|
|
|
@item --source@r{,} --source=@r{,} -Wsource=
|
|
The source text is appended to @code{program}.
|
|
|
|
@item --version@r{,} -Wversion
|
|
@command{igawk} prints its version number, runs @samp{gawk --version}
|
|
to get the @command{gawk} version information, and then exits.
|
|
@end table
|
|
|
|
If none of the @option{-f}, @option{--file}, @option{-Wfile}, @option{--source},
|
|
or @option{-Wsource} arguments are supplied, then the first nonoption argument
|
|
should be the @command{awk} program. If there are no command-line
|
|
arguments left, @command{igawk} prints an error message and exits.
|
|
Otherwise, the first argument is appended to @code{program}.
|
|
In any case, after the arguments have been processed,
|
|
@code{program} contains the complete text of the original @command{awk}
|
|
program.
|
|
|
|
The program is as follows:
|
|
|
|
@cindex @code{igawk.sh} program
|
|
@example
|
|
@c file eg/prog/igawk.sh
|
|
#! /bin/sh
|
|
# igawk --- like gawk but do @@include processing
|
|
@c endfile
|
|
@ignore
|
|
@c file eg/prog/igawk.sh
|
|
#
|
|
# Arnold Robbins, arnold@@gnu.org, Public Domain
|
|
# July 1993
|
|
|
|
@c endfile
|
|
@end ignore
|
|
@c file eg/prog/igawk.sh
|
|
if [ "$1" = debug ]
|
|
then
|
|
set -x
|
|
shift
|
|
fi
|
|
|
|
# A literal newline, so that program text is formmatted correctly
|
|
n='
|
|
'
|
|
|
|
# Initialize variables to empty
|
|
program=
|
|
opts=
|
|
|
|
while [ $# -ne 0 ] # loop over arguments
|
|
do
|
|
case $1 in
|
|
--) shift; break;;
|
|
|
|
-W) shift
|
|
# The $@{x?'message here'@} construct prints a
|
|
# diagnostic if $x is the null string
|
|
set -- -W"$@{@@?'missing operand'@}"
|
|
continue;;
|
|
|
|
-[vF]) opts="$opts $1 '$@{2?'missing operand'@}'"
|
|
shift;;
|
|
|
|
-[vF]*) opts="$opts '$1'" ;;
|
|
|
|
-f) program="$program$n@@include $@{2?'missing operand'@}"
|
|
shift;;
|
|
|
|
-f*) f=`expr "$1" : '-f\(.*\)'`
|
|
program="$program$n@@include $f";;
|
|
|
|
-[W-]file=*)
|
|
f=`expr "$1" : '-.file=\(.*\)'`
|
|
program="$program$n@@include $f";;
|
|
|
|
-[W-]file)
|
|
program="$program$n@@include $@{2?'missing operand'@}"
|
|
shift;;
|
|
|
|
-[W-]source=*)
|
|
t=`expr "$1" : '-.source=\(.*\)'`
|
|
program="$program$n$t";;
|
|
|
|
-[W-]source)
|
|
program="$program$n$@{2?'missing operand'@}"
|
|
shift;;
|
|
|
|
-[W-]version)
|
|
echo igawk: version 2.0 1>&2
|
|
gawk --version
|
|
exit 0 ;;
|
|
|
|
-[W-]*) opts="$opts '$1'" ;;
|
|
|
|
*) break;;
|
|
esac
|
|
shift
|
|
done
|
|
|
|
if [ -z "$program" ]
|
|
then
|
|
program=$@{1?'missing program'@}
|
|
shift
|
|
fi
|
|
|
|
# At this point, `program' has the program.
|
|
@c endfile
|
|
@end example
|
|
|
|
The @command{awk} program to process @samp{@@include} directives
|
|
is stored in the shell variable @code{expand_prog}. Doing this keeps
|
|
the shell script readable. The @command{awk} program
|
|
reads through the user's program, one line at a time, using @code{getline}
|
|
(@pxref{Getline}). The input
|
|
@value{FN}s and @samp{@@include} statements are managed using a stack.
|
|
As each @samp{@@include} is encountered, the current @value{FN} is
|
|
``pushed'' onto the stack and the file named in the @samp{@@include}
|
|
directive becomes the current @value{FN}. As each file is finished,
|
|
the stack is ``popped,'' and the previous input file becomes the current
|
|
input file again. The process is started by making the original file
|
|
the first one on the stack.
|
|
|
|
The @code{pathto} function does the work of finding the full path to
|
|
a file. It simulates @command{gawk}'s behavior when searching the
|
|
@env{AWKPATH} environment variable
|
|
(@pxref{AWKPATH Variable}).
|
|
If a @value{FN} has a @samp{/} in it, no path search is done. Otherwise,
|
|
the @value{FN} is concatenated with the name of each directory in
|
|
the path, and an attempt is made to open the generated @value{FN}.
|
|
The only way to test if a file can be read in @command{awk} is to go
|
|
ahead and try to read it with @code{getline}; this is what @code{pathto}
|
|
does.@footnote{On some very old versions of @command{awk}, the test
|
|
@samp{getline junk < t} can loop forever if the file exists but is empty.
|
|
Caveat emptor.} If the file can be read, it is closed and the @value{FN}
|
|
is returned:
|
|
|
|
@ignore
|
|
An alternative way to test for the file's existence would be to call
|
|
@samp{system("test -r " t)}, which uses the @command{test} utility to
|
|
see if the file exists and is readable. The disadvantage to this method
|
|
is that it requires creating an extra process and can thus be slightly
|
|
slower.
|
|
@end ignore
|
|
|
|
@example
|
|
@c file eg/prog/igawk.sh
|
|
expand_prog='
|
|
|
|
function pathto(file, i, t, junk)
|
|
@{
|
|
if (index(file, "/") != 0)
|
|
return file
|
|
|
|
for (i = 1; i <= ndirs; i++) @{
|
|
t = (pathlist[i] "/" file)
|
|
@group
|
|
if ((getline junk < t) > 0) @{
|
|
# found it
|
|
close(t)
|
|
return t
|
|
@}
|
|
@end group
|
|
@}
|
|
return ""
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The main program is contained inside one @code{BEGIN} rule. The first thing it
|
|
does is set up the @code{pathlist} array that @code{pathto} uses. After
|
|
splitting the path on @samp{:}, null elements are replaced with @code{"."},
|
|
which represents the current directory:
|
|
|
|
@example
|
|
@c file eg/prog/igawk.sh
|
|
BEGIN @{
|
|
path = ENVIRON["AWKPATH"]
|
|
ndirs = split(path, pathlist, ":")
|
|
for (i = 1; i <= ndirs; i++) @{
|
|
if (pathlist[i] == "")
|
|
pathlist[i] = "."
|
|
@}
|
|
@c endfile
|
|
@end example
|
|
|
|
The stack is initialized with @code{ARGV[1]}, which will be @file{/dev/stdin}.
|
|
The main loop comes next. Input lines are read in succession. Lines that
|
|
do not start with @samp{@@include} are printed verbatim.
|
|
If the line does start with @samp{@@include}, the @value{FN} is in @code{$2}.
|
|
@code{pathto} is called to generate the full path. If it cannot, then we
|
|
print an error message and continue.
|
|
|
|
The next thing to check is if the file is included already. The
|
|
@code{processed} array is indexed by the full @value{FN} of each included
|
|
file and it tracks this information for us. If the file is
|
|
seen again, a warning message is printed. Otherwise, the new @value{FN} is
|
|
pushed onto the stack and processing continues.
|
|
|
|
Finally, when @code{getline} encounters the end of the input file, the file
|
|
is closed and the stack is popped. When @code{stackptr} is less than zero,
|
|
the program is done:
|
|
|
|
@example
|
|
@c file eg/prog/igawk.sh
|
|
stackptr = 0
|
|
input[stackptr] = ARGV[1] # ARGV[1] is first file
|
|
|
|
for (; stackptr >= 0; stackptr--) @{
|
|
while ((getline < input[stackptr]) > 0) @{
|
|
if (tolower($1) != "@@include") @{
|
|
print
|
|
continue
|
|
@}
|
|
fpath = pathto($2)
|
|
@group
|
|
if (fpath == "") @{
|
|
printf("igawk:%s:%d: cannot find %s\n",
|
|
input[stackptr], FNR, $2) > "/dev/stderr"
|
|
continue
|
|
@}
|
|
@end group
|
|
if (! (fpath in processed)) @{
|
|
processed[fpath] = input[stackptr]
|
|
input[++stackptr] = fpath # push onto stack
|
|
@} else
|
|
print $2, "included in", input[stackptr],
|
|
"already included in",
|
|
processed[fpath] > "/dev/stderr"
|
|
@}
|
|
close(input[stackptr])
|
|
@}
|
|
@}' # close quote ends `expand_prog' variable
|
|
|
|
processed_program=`gawk -- "$expand_prog" /dev/stdin <<EOF
|
|
$program
|
|
EOF
|
|
`
|
|
@c endfile
|
|
@end example
|
|
|
|
The shell construct @samp{@var{command} << @var{marker}} is called a @dfn{here document}.
|
|
Everything in the shell script up to the @var{marker} is fed to @var{command} as input.
|
|
The shell processes the contents of the here document for variable and command substitution
|
|
(and possibly other things as well, depending upon the shell).
|
|
|
|
The shell construct @samp{`@dots{}`} is called @dfn{command substitution}.
|
|
The output of the command between the two backquotes (grave accents) is substituted
|
|
into the command line. It is saved as a single string, even if the results
|
|
contain whitespace.
|
|
|
|
The expanded program is saved in the variable @code{processed_program}.
|
|
It's done in these steps:
|
|
|
|
@enumerate
|
|
@item
|
|
Run @command{gawk} with the @samp{@@include}-processing program (the
|
|
value of the @code{expand_prog} shell variable) on standard input.
|
|
|
|
@item
|
|
Standard input is the contents of the user's program, from the shell variable @code{program}.
|
|
Its contents are fed to @command{gawk} via a here document.
|
|
|
|
@item
|
|
The results of this processing are saved in the shell variable @code{processed_program} by using command substitution.
|
|
@end enumerate
|
|
|
|
The last step is to call @command{gawk} with the expanded program,
|
|
along with the original
|
|
options and command-line arguments that the user supplied.
|
|
|
|
@c this causes more problems than it solves, so leave it out.
|
|
@ignore
|
|
The special file @file{/dev/null} is passed as a @value{DF} to @command{gawk}
|
|
to handle an interesting case. Suppose that the user's program only has
|
|
a @code{BEGIN} rule and there are no @value{DF}s to read.
|
|
The program should exit without reading any @value{DF}s.
|
|
However, suppose that an included library file defines an @code{END}
|
|
rule of its own. In this case, @command{gawk} will hang, reading standard
|
|
input. In order to avoid this, @file{/dev/null} is explicitly added to the
|
|
command-line. Reading from @file{/dev/null} always returns an immediate
|
|
end of file indication.
|
|
|
|
@c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh.
|
|
@end ignore
|
|
|
|
@example
|
|
@c file eg/prog/igawk.sh
|
|
eval gawk $opts -- '"$processed_program"' '"$@@"'
|
|
@c endfile
|
|
@end example
|
|
|
|
The @command{eval} command is a shell construct that reruns the shell's parsing
|
|
process. This keeps things properly quoted.
|
|
|
|
This version of @command{igawk} represents my fourth attempt at this program.
|
|
There are four key simplifications that make the program work better:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Using @samp{@@include} even for the files named with @option{-f} makes building
|
|
the initial collected @command{awk} program much simpler; all the
|
|
@samp{@@include} processing can be done once.
|
|
|
|
@item
|
|
Not trying to save the line read with @code{getline}
|
|
in the @code{pathto} function when testing for the
|
|
file's accessibility for use with the main program simplifies things
|
|
considerably.
|
|
@c what problem does this engender though - exercise
|
|
@c answer, reading from "-" or /dev/stdin
|
|
|
|
@item
|
|
Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
|
|
place. It is not necessary to call out to a separate loop for processing
|
|
nested @samp{@@include} statements.
|
|
|
|
@item
|
|
Instead of saving the expanded program in a temporary file, putting it in a shell variable
|
|
avoids some potential security problems.
|
|
This has the disadvantage that the script relies upon more features
|
|
of the @command{sh} language, making it harder to follow for those who
|
|
aren't familiar with @command{sh}.
|
|
@end itemize
|
|
|
|
Also, this program illustrates that it is often worthwhile to combine
|
|
@command{sh} and @command{awk} programming together. You can usually
|
|
accomplish quite a lot, without having to resort to low-level programming
|
|
in C or C++, and it is frequently easier to do certain kinds of string
|
|
and argument manipulation using the shell than it is in @command{awk}.
|
|
|
|
Finally, @command{igawk} shows that it is not always necessary to add new
|
|
features to a program; they can often be layered on top. With @command{igawk},
|
|
there is no real reason to build @samp{@@include} processing into
|
|
@command{gawk} itself.
|
|
|
|
@cindex search paths, for source files
|
|
@c comma is part of primary
|
|
@cindex source files, search path for
|
|
@c last comma is part of secondary
|
|
@cindex files, source, search path for
|
|
@cindex directories, searching
|
|
As an additional example of this, consider the idea of having two
|
|
files in a directory in the search path:
|
|
|
|
@table @file
|
|
@item default.awk
|
|
This file contains a set of default library functions, such
|
|
as @code{getopt} and @code{assert}.
|
|
|
|
@item site.awk
|
|
This file contains library functions that are specific to a site or
|
|
installation; i.e., locally developed functions.
|
|
Having a separate file allows @file{default.awk} to change with
|
|
new @command{gawk} releases, without requiring the system administrator to
|
|
update it each time by adding the local functions.
|
|
@end table
|
|
|
|
One user
|
|
@c Karl Berry, karl@ileaf.com, 10/95
|
|
suggested that @command{gawk} be modified to automatically read these files
|
|
upon startup. Instead, it would be very simple to modify @command{igawk}
|
|
to do this. Since @command{igawk} can process nested @samp{@@include}
|
|
directives, @file{default.awk} could simply contain @samp{@@include}
|
|
statements for the desired library functions.
|
|
@c ENDOFRANGE libfex
|
|
@c ENDOFRANGE flibex
|
|
@c ENDOFRANGE awkpex
|
|
|
|
@c Exercise: make this change
|
|
|
|
@ignore
|
|
@c Try this
|
|
@iftex
|
|
@page
|
|
@headings off
|
|
@majorheading III@ @ @ Appendixes
|
|
Part III provides the appendixes, the Glossary, and two licenses that cover
|
|
the @command{gawk} source code and this @value{DOCUMENT}, respectively.
|
|
It contains the following appendixes:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@ref{Language History}.
|
|
|
|
@item
|
|
@ref{Installation}.
|
|
|
|
@item
|
|
@ref{Notes}.
|
|
|
|
@item
|
|
@ref{Basic Concepts}.
|
|
|
|
@item
|
|
@ref{Glossary}.
|
|
|
|
@item
|
|
@ref{Copying}.
|
|
|
|
@item
|
|
@ref{GNU Free Documentation License}.
|
|
@end itemize
|
|
|
|
@page
|
|
@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
|
|
@oddheading @| @| @strong{@thischapter}@ @ @ @thispage
|
|
@end iftex
|
|
@end ignore
|
|
|
|
@node Language History
|
|
@appendix The Evolution of the @command{awk} Language
|
|
|
|
This @value{DOCUMENT} describes the GNU implementation of @command{awk}, which follows
|
|
the POSIX specification.
|
|
Many long-time @command{awk} users learned @command{awk} programming
|
|
with the original @command{awk} implementation in Version 7 Unix.
|
|
(This implementation was the basis for @command{awk} in Berkeley Unix,
|
|
through 4.3-Reno. Subsequent versions of Berkeley Unix, and systems
|
|
derived from 4.4BSD-Lite, use various versions of @command{gawk}
|
|
for their @command{awk}.)
|
|
This @value{CHAPTER} briefly describes the
|
|
evolution of the @command{awk} language, with cross-references to other parts
|
|
of the @value{DOCUMENT} where you can find more information.
|
|
|
|
@menu
|
|
* V7/SVR3.1:: The major changes between V7 and System V
|
|
Release 3.1.
|
|
* SVR4:: Minor changes between System V Releases 3.1
|
|
and 4.
|
|
* POSIX:: New features from the POSIX standard.
|
|
* BTL:: New features from the Bell Laboratories
|
|
version of @command{awk}.
|
|
* POSIX/GNU:: The extensions in @command{gawk} not in POSIX
|
|
@command{awk}.
|
|
* Contributors:: The major contributors to @command{gawk}.
|
|
@end menu
|
|
|
|
@node V7/SVR3.1
|
|
@appendixsec Major Changes Between V7 and SVR3.1
|
|
@c STARTOFRANGE gawkv
|
|
@cindex @command{awk}, versions of
|
|
@c STARTOFRANGE gawkv1
|
|
@cindex @command{awk}, versions of, changes between V7 and SVR3.1
|
|
|
|
The @command{awk} language evolved considerably between the release of
|
|
Version 7 Unix (1978) and the new version that was first made generally available in
|
|
System V Release 3.1 (1987). This @value{SECTION} summarizes the changes, with
|
|
cross-references to further details:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The requirement for @samp{;} to separate rules on a line
|
|
(@pxref{Statements/Lines}).
|
|
|
|
@item
|
|
User-defined functions and the @code{return} statement
|
|
(@pxref{User-defined}).
|
|
|
|
@item
|
|
The @code{delete} statement (@pxref{Delete}).
|
|
|
|
@item
|
|
The @code{do}-@code{while} statement
|
|
(@pxref{Do Statement}).
|
|
|
|
@item
|
|
The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand}, and
|
|
@code{srand} (@pxref{Numeric Functions}).
|
|
|
|
@item
|
|
The built-in functions @code{gsub}, @code{sub}, and @code{match}
|
|
(@pxref{String Functions}).
|
|
|
|
@item
|
|
The built-in functions @code{close} and @code{system}
|
|
(@pxref{I/O Functions}).
|
|
|
|
@item
|
|
The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART},
|
|
and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The conditional expression using the ternary operator @samp{?:}
|
|
(@pxref{Conditional Exp}).
|
|
|
|
@item
|
|
The exponentiation operator @samp{^}
|
|
(@pxref{Arithmetic Ops}) and its assignment operator
|
|
form @samp{^=} (@pxref{Assignment Ops}).
|
|
|
|
@item
|
|
C-compatible operator precedence, which breaks some old @command{awk}
|
|
programs (@pxref{Precedence}).
|
|
|
|
@item
|
|
Regexps as the value of @code{FS}
|
|
(@pxref{Field Separators}) and as the
|
|
third argument to the @code{split} function
|
|
(@pxref{String Functions}).
|
|
|
|
@item
|
|
Dynamic regexps as operands of the @samp{~} and @samp{!~} operators
|
|
(@pxref{Regexp Usage}).
|
|
|
|
@item
|
|
The escape sequences @samp{\b}, @samp{\f}, and @samp{\r}
|
|
(@pxref{Escape Sequences}).
|
|
(Some vendors have updated their old versions of @command{awk} to
|
|
recognize @samp{\b}, @samp{\f}, and @samp{\r}, but this is not
|
|
something you can rely on.)
|
|
|
|
@item
|
|
Redirection of input for the @code{getline} function
|
|
(@pxref{Getline}).
|
|
|
|
@item
|
|
Multiple @code{BEGIN} and @code{END} rules
|
|
(@pxref{BEGIN/END}).
|
|
|
|
@item
|
|
Multidimensional arrays
|
|
(@pxref{Multi-dimensional}).
|
|
@end itemize
|
|
@c ENDOFRANGE gawkv1
|
|
|
|
@node SVR4
|
|
@appendixsec Changes Between SVR3.1 and SVR4
|
|
|
|
@cindex @command{awk}, versions of, changes between SVR3.1 and SVR4
|
|
The System V Release 4 (1989) version of Unix @command{awk} added these features
|
|
(some of which originated in @command{gawk}):
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{ENVIRON} variable (@pxref{Built-in Variables}).
|
|
@c gawk and MKS awk
|
|
|
|
@item
|
|
Multiple @option{-f} options on the command line
|
|
(@pxref{Options}).
|
|
@c MKS awk
|
|
|
|
@item
|
|
The @option{-v} option for assigning variables before program execution begins
|
|
(@pxref{Options}).
|
|
@c GNU, Bell Laboratories & MKS together
|
|
|
|
@item
|
|
The @option{--} option for terminating command-line options.
|
|
|
|
@item
|
|
The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences
|
|
(@pxref{Escape Sequences}).
|
|
@c GNU, for ANSI C compat
|
|
|
|
@item
|
|
A defined return value for the @code{srand} built-in function
|
|
(@pxref{Numeric Functions}).
|
|
|
|
@item
|
|
The @code{toupper} and @code{tolower} built-in string functions
|
|
for case translation
|
|
(@pxref{String Functions}).
|
|
|
|
@item
|
|
A cleaner specification for the @samp{%c} format-control letter in the
|
|
@code{printf} function
|
|
(@pxref{Control Letters}).
|
|
|
|
@item
|
|
The ability to dynamically pass the field width and precision (@code{"%*.*d"})
|
|
in the argument list of the @code{printf} function
|
|
(@pxref{Control Letters}).
|
|
|
|
@item
|
|
The use of regexp constants, such as @code{/foo/}, as expressions, where
|
|
they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/}
|
|
(@pxref{Using Constant Regexps}).
|
|
|
|
@item
|
|
Processing of escape sequences inside command-line variable assignments
|
|
(@pxref{Assignment Options}).
|
|
@end itemize
|
|
|
|
@node POSIX
|
|
@appendixsec Changes Between SVR4 and POSIX @command{awk}
|
|
@cindex @command{awk}, versions of, changes between SVR4 and POSIX @command{awk}
|
|
@cindex POSIX @command{awk}, changes in @command{awk} versions
|
|
|
|
The POSIX Command Language and Utilities standard for @command{awk} (1992)
|
|
introduced the following changes into the language:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The use of @option{-W} for implementation-specific options
|
|
(@pxref{Options}).
|
|
|
|
@item
|
|
The use of @code{CONVFMT} for controlling the conversion of numbers
|
|
to strings (@pxref{Conversion}).
|
|
|
|
@item
|
|
The concept of a numeric string and tighter comparison rules to go
|
|
with it (@pxref{Typing and Comparison}).
|
|
|
|
@item
|
|
More complete documentation of many of the previously undocumented
|
|
features of the language.
|
|
@end itemize
|
|
|
|
The following common extensions are not permitted by the POSIX
|
|
standard:
|
|
|
|
@c IMPORTANT! Keep this list in sync with the one in node Options
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@code{\x} escape sequences are not recognized
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@item
|
|
Newlines do not act as whitespace to separate fields when @code{FS} is
|
|
equal to a single space
|
|
(@pxref{Fields}).
|
|
|
|
@item
|
|
Newlines are not allowed after @samp{?} or @samp{:}
|
|
(@pxref{Conditional Exp}).
|
|
|
|
@item
|
|
The synonym @code{func} for the keyword @code{function} is not
|
|
recognized (@pxref{Definition Syntax}).
|
|
|
|
@item
|
|
The operators @samp{**} and @samp{**=} cannot be used in
|
|
place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops},
|
|
and @ref{Assignment Ops}).
|
|
|
|
@item
|
|
Specifying @samp{-Ft} on the command line does not set the value
|
|
of @code{FS} to be a single TAB character
|
|
(@pxref{Field Separators}).
|
|
|
|
@item
|
|
The @code{fflush} built-in function is not supported
|
|
(@pxref{I/O Functions}).
|
|
@end itemize
|
|
@c ENDOFRANGE gawkv
|
|
|
|
@node BTL
|
|
@appendixsec Extensions in the Bell Laboratories @command{awk}
|
|
|
|
@cindex @command{awk}, versions of, See Also Bell Laboratories @command{awk}
|
|
@cindex extensions, Bell Laboratories @command{awk}
|
|
@cindex Bell Laboratories @command{awk} extensions
|
|
@cindex Kernighan, Brian
|
|
Brian Kernighan, one of the original designers of Unix @command{awk},
|
|
has made his version available via his home page
|
|
(@pxref{Other Versions}).
|
|
This @value{SECTION} describes extensions in his version of @command{awk} that are
|
|
not in POSIX @command{awk}:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @samp{-mf @var{N}} and @samp{-mr @var{N}} command-line options
|
|
to set the maximum number of fields and the maximum
|
|
record size, respectively
|
|
(@pxref{Options}).
|
|
As a side note, his @command{awk} no longer needs these options;
|
|
it continues to accept them to avoid breaking old programs.
|
|
|
|
@item
|
|
The @code{fflush} built-in function for flushing buffered output
|
|
(@pxref{I/O Functions}).
|
|
|
|
@item
|
|
The @samp{**} and @samp{**=} operators
|
|
(@pxref{Arithmetic Ops}
|
|
and
|
|
@ref{Assignment Ops}).
|
|
|
|
@item
|
|
The use of @code{func} as an abbreviation for @code{function}
|
|
(@pxref{Definition Syntax}).
|
|
|
|
@ignore
|
|
@item
|
|
The @code{SYMTAB} array, that allows access to @command{awk}'s internal symbol
|
|
table. This feature is not documented, largely because
|
|
it is somewhat shakily implemented. For instance, you cannot access arrays
|
|
or array elements through it.
|
|
@end ignore
|
|
@end itemize
|
|
|
|
The Bell Laboratories @command{awk} also incorporates the following extensions,
|
|
originally developed for @command{gawk}:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @samp{\x} escape sequence
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@item
|
|
The @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}
|
|
special files
|
|
(@pxref{Special Files}).
|
|
|
|
@item
|
|
The ability for @code{FS} and for the third
|
|
argument to @code{split} to be null strings
|
|
(@pxref{Single Character Fields}).
|
|
|
|
@item
|
|
The @code{nextfile} statement
|
|
(@pxref{Nextfile Statement}).
|
|
|
|
@item
|
|
The ability to delete all of an array at once with @samp{delete @var{array}}
|
|
(@pxref{Delete}).
|
|
@end itemize
|
|
|
|
@node POSIX/GNU
|
|
@appendixsec Extensions in @command{gawk} Not in POSIX @command{awk}
|
|
|
|
@ignore
|
|
I've tried to follow this general order, esp. for the 3.0 and 3.1 sections:
|
|
variables
|
|
special files
|
|
language changes (e.g., hex constants)
|
|
differences in standard awk functions
|
|
new gawk functions
|
|
new keywords
|
|
new command-line options
|
|
new ports
|
|
Within each category, be alphabetical.
|
|
@end ignore
|
|
|
|
@c STARTOFRANGE fripls
|
|
@cindex compatibility mode (@command{gawk}), extensions
|
|
@c STARTOFRANGE exgnot
|
|
@cindex extensions, in @command{gawk}, not in POSIX @command{awk}
|
|
@c STARTOFRANGE posnot
|
|
@cindex POSIX, @command{gawk} extensions not included in
|
|
The GNU implementation, @command{gawk}, adds a large number of features.
|
|
This @value{SECTION} lists them in the order they were added to @command{gawk}.
|
|
They can all be disabled with either the @option{--traditional} or
|
|
@option{--posix} options
|
|
(@pxref{Options}).
|
|
|
|
Version 2.10 of @command{gawk} introduced the following features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @env{AWKPATH} environment variable for specifying a path search for
|
|
the @option{-f} command-line option
|
|
(@pxref{Options}).
|
|
|
|
@item
|
|
The @code{IGNORECASE} variable and its effects
|
|
(@pxref{Case-sensitivity}).
|
|
|
|
@item
|
|
The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr} and
|
|
@file{/dev/fd/@var{N}} special @value{FN}s
|
|
(@pxref{Special Files}).
|
|
@end itemize
|
|
|
|
Version 2.13 of @command{gawk} introduced the following features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{FIELDWIDTHS} variable and its effects
|
|
(@pxref{Constant Size}).
|
|
|
|
@item
|
|
The @code{systime} and @code{strftime} built-in functions for obtaining
|
|
and printing timestamps
|
|
(@pxref{Time Functions}).
|
|
|
|
@item
|
|
The @option{-W lint} option to provide error and portability checking
|
|
for both the source code and at runtime
|
|
(@pxref{Options}).
|
|
|
|
@item
|
|
The @option{-W compat} option to turn off the GNU extensions
|
|
(@pxref{Options}).
|
|
|
|
@item
|
|
The @option{-W posix} option for full POSIX compliance
|
|
(@pxref{Options}).
|
|
@end itemize
|
|
|
|
Version 2.14 of @command{gawk} introduced the following feature:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{next file} statement for skipping to the next @value{DF}
|
|
(@pxref{Nextfile Statement}).
|
|
@end itemize
|
|
|
|
Version 2.15 of @command{gawk} introduced the following features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{ARGIND} variable, which tracks the movement of @code{FILENAME}
|
|
through @code{ARGV} (@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The @code{ERRNO} variable, which contains the system error message when
|
|
@code{getline} returns @minus{}1 or @code{close} fails
|
|
(@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and
|
|
@file{/dev/user} @value{FN} interpretation
|
|
(@pxref{Special Files}).
|
|
|
|
@item
|
|
The ability to delete all of an array at once with @samp{delete @var{array}}
|
|
(@pxref{Delete}).
|
|
|
|
@item
|
|
The ability to use GNU-style long-named options that start with @option{--}
|
|
(@pxref{Options}).
|
|
|
|
@item
|
|
The @option{--source} option for mixing command-line and library-file
|
|
source code
|
|
(@pxref{Options}).
|
|
@end itemize
|
|
|
|
Version 3.0 of @command{gawk} introduced the following features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@code{IGNORECASE} changed, now applying to string comparison as well
|
|
as regexp operations
|
|
(@pxref{Case-sensitivity}).
|
|
|
|
@item
|
|
The @code{RT} variable that contains the input text that
|
|
matched @code{RS}
|
|
(@pxref{Records}).
|
|
|
|
@item
|
|
Full support for both POSIX and GNU regexps
|
|
(@pxref{Regexp}).
|
|
|
|
@item
|
|
The @code{gensub} function for more powerful text manipulation
|
|
(@pxref{String Functions}).
|
|
|
|
@item
|
|
The @code{strftime} function acquired a default time format,
|
|
allowing it to be called with no arguments
|
|
(@pxref{Time Functions}).
|
|
|
|
@item
|
|
The ability for @code{FS} and for the third
|
|
argument to @code{split} to be null strings
|
|
(@pxref{Single Character Fields}).
|
|
|
|
@item
|
|
The ability for @code{RS} to be a regexp
|
|
(@pxref{Records}).
|
|
|
|
@item
|
|
The @code{next file} statement became @code{nextfile}
|
|
(@pxref{Nextfile Statement}).
|
|
|
|
@item
|
|
The @option{--lint-old} option to
|
|
warn about constructs that are not available in
|
|
the original Version 7 Unix version of @command{awk}
|
|
(@pxref{V7/SVR3.1}).
|
|
|
|
@item
|
|
The @option{-m} option and the @code{fflush} function from the
|
|
Bell Laboratories research version of @command{awk}
|
|
(@pxref{Options}; also
|
|
@pxref{I/O Functions}).
|
|
|
|
@item
|
|
The @option{--re-interval} option to provide interval expressions in regexps
|
|
(@pxref{Regexp Operators}).
|
|
|
|
@item
|
|
The @option{--traditional} option was added as a better name for
|
|
@option{--compat} (@pxref{Options}).
|
|
|
|
@item
|
|
The use of GNU Autoconf to control the configuration process
|
|
(@pxref{Quick Installation}).
|
|
|
|
@item
|
|
Amiga support
|
|
(@pxref{Amiga Installation}).
|
|
|
|
@end itemize
|
|
|
|
Version 3.1 of @command{gawk} introduced the following features:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{BINMODE} special variable for non-POSIX systems,
|
|
which allows binary I/O for input and/or output files
|
|
(@pxref{PC Using}).
|
|
|
|
@item
|
|
The @code{LINT} special variable, which dynamically controls lint warnings
|
|
(@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The @code{PROCINFO} array for providing process-related information
|
|
(@pxref{Built-in Variables}).
|
|
|
|
@item
|
|
The @code{TEXTDOMAIN} special variable for setting an application's
|
|
internationalization text domain
|
|
(@pxref{Built-in Variables},
|
|
and
|
|
@ref{Internationalization}).
|
|
|
|
@item
|
|
The ability to use octal and hexadecimal constants in @command{awk}
|
|
program source code
|
|
(@pxref{Nondecimal-numbers}).
|
|
|
|
@item
|
|
The @samp{|&} operator for two-way I/O to a coprocess
|
|
(@pxref{Two-way I/O}).
|
|
|
|
@item
|
|
The @file{/inet} special files for TCP/IP networking using @samp{|&}
|
|
(@pxref{TCP/IP Networking}).
|
|
|
|
@item
|
|
The optional second argument to @code{close} that allows closing one end
|
|
of a two-way pipe to a coprocess
|
|
(@pxref{Two-way I/O}).
|
|
|
|
@item
|
|
The optional third argument to the @code{match} function
|
|
for capturing text-matching subexpressions within a regexp
|
|
(@pxref{String Functions}).
|
|
|
|
@item
|
|
Positional specifiers in @code{printf} formats for
|
|
making translations easier
|
|
(@pxref{Printf Ordering}).
|
|
|
|
@item
|
|
The @code{asort} and @code{asorti} functions for sorting arrays
|
|
(@pxref{Array Sorting}).
|
|
|
|
@item
|
|
The @code{bindtextdomain}, @code{dcgettext} and @code{dcngettext} functions
|
|
for internationalization
|
|
(@pxref{Programmer i18n}).
|
|
|
|
@item
|
|
The @code{extension} built-in function and the ability to add
|
|
new built-in functions dynamically
|
|
(@pxref{Dynamic Extensions}).
|
|
|
|
@item
|
|
The @code{mktime} built-in function for creating timestamps
|
|
(@pxref{Time Functions}).
|
|
|
|
@item
|
|
The
|
|
@code{and},
|
|
@code{or},
|
|
@code{xor},
|
|
@code{compl},
|
|
@code{lshift},
|
|
@code{rshift},
|
|
and
|
|
@code{strtonum} built-in
|
|
functions
|
|
(@pxref{Bitwise Functions}).
|
|
|
|
@item
|
|
@cindex @code{next file} statement
|
|
The support for @samp{next file} as two words was removed completely
|
|
(@pxref{Nextfile Statement}).
|
|
|
|
@item
|
|
The @option{--dump-variables} option to print a list of all global variables
|
|
(@pxref{Options}).
|
|
|
|
@item
|
|
The @option{--gen-po} command-line option and the use of a leading
|
|
underscore to mark strings that should be translated
|
|
(@pxref{String Extraction}).
|
|
|
|
@item
|
|
The @option{--non-decimal-data} option to allow non-decimal
|
|
input data
|
|
(@pxref{Nondecimal Data}).
|
|
|
|
@item
|
|
The @option{--profile} option and @command{pgawk}, the
|
|
profiling version of @command{gawk}, for producing execution
|
|
profiles of @command{awk} programs
|
|
(@pxref{Profiling}).
|
|
|
|
@item
|
|
The @option{--enable-portals} configuration option to enable special treatment of
|
|
pathnames that begin with @file{/p} as BSD portals
|
|
(@pxref{Portal Files}).
|
|
|
|
@item
|
|
The use of GNU Automake to help in standardizing the configuration process
|
|
(@pxref{Quick Installation}).
|
|
|
|
@item
|
|
The use of GNU @code{gettext} for @command{gawk}'s own message output
|
|
(@pxref{Gawk I18N}).
|
|
|
|
@item
|
|
BeOS support
|
|
(@pxref{BeOS Installation}).
|
|
|
|
@item
|
|
Tandem support
|
|
(@pxref{Tandem Installation}).
|
|
|
|
@item
|
|
The Atari port became officially unsupported
|
|
(@pxref{Atari Installation}).
|
|
|
|
@item
|
|
The source code now uses new-style function definitions, with
|
|
@command{ansi2knr} to convert the code on systems with old compilers.
|
|
|
|
@item
|
|
The @option{--disable-lint} configuration option to disable lint checking
|
|
at compile time
|
|
(@pxref{Additional Configuration Options}).
|
|
|
|
@end itemize
|
|
|
|
@c XXX ADD MORE STUFF HERE
|
|
|
|
@c ENDOFRANGE fripls
|
|
@c ENDOFRANGE exgnot
|
|
@c ENDOFRANGE posnot
|
|
|
|
@node Contributors
|
|
@appendixsec Major Contributors to @command{gawk}
|
|
@cindex @command{gawk}, list of contributors to
|
|
@quotation
|
|
@i{Always give credit where credit is due.}@*
|
|
Anonymous
|
|
@end quotation
|
|
|
|
This @value{SECTION} names the major contributors to @command{gawk}
|
|
and/or this @value{DOCUMENT}, in approximate chronological order:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@cindex Aho, Alfred
|
|
@cindex Weinberger, Peter
|
|
@cindex Kernighan, Brian
|
|
Dr.@: Alfred V.@: Aho,
|
|
Dr.@: Peter J.@: Weinberger, and
|
|
Dr.@: Brian W.@: Kernighan, all of Bell Laboratories,
|
|
designed and implemented Unix @command{awk},
|
|
from which @command{gawk} gets the majority of its feature set.
|
|
|
|
@item
|
|
@cindex Rubin, Paul
|
|
Paul Rubin
|
|
did the initial design and implementation in 1986, and wrote
|
|
the first draft (around 40 pages) of this @value{DOCUMENT}.
|
|
|
|
@item
|
|
@cindex Fenlason, Jay
|
|
Jay Fenlason
|
|
finished the initial implementation.
|
|
|
|
@item
|
|
@cindex Close, Diane
|
|
Diane Close
|
|
revised the first draft of this @value{DOCUMENT}, bringing it
|
|
to around 90 pages.
|
|
|
|
@item
|
|
@cindex Stallman, Richard
|
|
Richard Stallman
|
|
helped finish the implementation and the initial draft of this
|
|
@value{DOCUMENT}.
|
|
He is also the founder of the FSF and the GNU project.
|
|
|
|
@item
|
|
@cindex Woods, John
|
|
John Woods
|
|
contributed parts of the code (mostly fixes) in
|
|
the initial version of @command{gawk}.
|
|
|
|
@item
|
|
@cindex Trueman, David
|
|
In 1988,
|
|
David Trueman
|
|
took over primary maintenance of @command{gawk},
|
|
making it compatible with ``new'' @command{awk}, and
|
|
greatly improving its performance.
|
|
|
|
@item
|
|
@cindex Rankin, Pat
|
|
Pat Rankin
|
|
provided the VMS port and its documentation.
|
|
|
|
@item
|
|
@cindex Kwok, Conrad
|
|
@cindex Garfinkle, Scott
|
|
@cindex Williams, Kent
|
|
Conrad Kwok,
|
|
Scott Garfinkle,
|
|
and
|
|
Kent Williams
|
|
did the initial ports to MS-DOS with various versions of MSC.
|
|
|
|
@item
|
|
@cindex Peterson, Hal
|
|
Hal Peterson
|
|
provided help in porting @command{gawk} to Cray systems.
|
|
|
|
@item
|
|
@cindex Rommel, Kai Uwe
|
|
Kai Uwe Rommel
|
|
provided the initial port to OS/2 and its documentation.
|
|
|
|
@item
|
|
@cindex Jaegermann, Michal
|
|
Michal Jaegermann
|
|
provided the port to Atari systems and its documentation.
|
|
He continues to provide portability checking with DEC Alpha
|
|
systems, and has done a lot of work to make sure @command{gawk}
|
|
works on non-32-bit systems.
|
|
|
|
@item
|
|
@cindex Fish, Fred
|
|
Fred Fish
|
|
provided the port to Amiga systems and its documentation.
|
|
|
|
@item
|
|
@cindex Deifik, Scott
|
|
Scott Deifik
|
|
currently maintains the MS-DOS port.
|
|
|
|
@item
|
|
@cindex Grigera, Juan
|
|
Juan Grigera
|
|
maintains the port to Windows32 systems.
|
|
|
|
@item
|
|
@cindex Hankerson, Darrel
|
|
Dr.@: Darrel Hankerson
|
|
acts as coordinator for the various ports to different PC platforms
|
|
and creates binary distributions for various PC operating systems.
|
|
He is also instrumental in keeping the documentation up to date for
|
|
the various PC platforms.
|
|
|
|
@item
|
|
@cindex Zoulas, Christos
|
|
Christos Zoulas
|
|
provided the @code{extension}
|
|
built-in function for dynamically adding new modules.
|
|
|
|
@item
|
|
@cindex Kahrs, J@"urgen
|
|
J@"urgen Kahrs
|
|
contributed the initial version of the TCP/IP networking
|
|
code and documentation, and motivated the inclusion of the @samp{|&} operator.
|
|
|
|
@item
|
|
@cindex Davies, Stephen
|
|
Stephen Davies
|
|
provided the port to Tandem systems and its documentation.
|
|
|
|
@item
|
|
@cindex Brown, Martin
|
|
Martin Brown
|
|
provided the port to BeOS and its documentation.
|
|
|
|
@item
|
|
@cindex Peters, Arno
|
|
Arno Peters
|
|
did the initial work to convert @command{gawk} to use
|
|
GNU Automake and @code{gettext}.
|
|
|
|
@item
|
|
@cindex Broder, Alan J.@:
|
|
Alan J.@: Broder
|
|
provided the initial version of the @code{asort} function
|
|
as well as the code for the new optional third argument to the @code{match} function.
|
|
|
|
@item
|
|
@cindex Buening, Andreas
|
|
Andreas Buening
|
|
updated the @command{gawk} port for OS/2.
|
|
|
|
@cindex Hasegawa, Isamu
|
|
Isamu Hasegawa,
|
|
of IBM in Japan, contributed support for multibyte characters.
|
|
|
|
@cindex Benzinger, Michael
|
|
Michael Benzinger contributed the initial code for @code{switch} statements.
|
|
|
|
@cindex McPhee, Patrick
|
|
Patrick T.J.@: McPhee contributed the code for dynamic loading in Windows32
|
|
environments.
|
|
|
|
@item
|
|
@cindex Robbins, Arnold
|
|
Arnold Robbins
|
|
has been working on @command{gawk} since 1988, at first
|
|
helping David Trueman, and as the primary maintainer since around 1994.
|
|
@end itemize
|
|
|
|
@node Installation
|
|
@appendix Installing @command{gawk}
|
|
|
|
@c last two commas are part of see also
|
|
@cindex operating systems, See Also GNU/Linux, PC operating systems, Unix
|
|
@c STARTOFRANGE gligawk
|
|
@cindex @command{gawk}, installing
|
|
@c STARTOFRANGE ingawk
|
|
@cindex installing @command{gawk}
|
|
This appendix provides instructions for installing @command{gawk} on the
|
|
various platforms that are supported by the developers. The primary
|
|
developer supports GNU/Linux (and Unix), whereas the other ports are
|
|
contributed.
|
|
@xref{Bugs},
|
|
for the electronic mail addresses of the people who did
|
|
the respective ports.
|
|
|
|
@menu
|
|
* Gawk Distribution:: What is in the @command{gawk} distribution.
|
|
* Unix Installation:: Installing @command{gawk} under various
|
|
versions of Unix.
|
|
* Non-Unix Installation:: Installation on Other Operating Systems.
|
|
* Unsupported:: Systems whose ports are no longer supported.
|
|
* Bugs:: Reporting Problems and Bugs.
|
|
* Other Versions:: Other freely available @command{awk}
|
|
implementations.
|
|
@end menu
|
|
|
|
@node Gawk Distribution
|
|
@appendixsec The @command{gawk} Distribution
|
|
@cindex source code, @command{gawk}
|
|
|
|
This @value{SECTION} describes how to get the @command{gawk}
|
|
distribution, how to extract it, and then what is in the various files and
|
|
subdirectories.
|
|
|
|
@menu
|
|
* Getting:: How to get the distribution.
|
|
* Extracting:: How to extract the distribution.
|
|
* Distribution contents:: What is in the distribution.
|
|
@end menu
|
|
|
|
@node Getting
|
|
@appendixsubsec Getting the @command{gawk} Distribution
|
|
@c last comma is part of secondary
|
|
@cindex @command{gawk}, source code, obtaining
|
|
There are three ways to get GNU software:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Copy it from someone else who already has it.
|
|
|
|
@cindex FSF (Free Software Foundation)
|
|
@cindex Free Software Foundation (FSF)
|
|
@item
|
|
Order @command{gawk} directly from the Free Software Foundation.
|
|
Software distributions are available for
|
|
Gnu/Linux, Unix, and MS-Windows, in several CD packages.
|
|
Their address is:
|
|
|
|
@display
|
|
Free Software Foundation
|
|
59 Temple Place, Suite 330
|
|
Boston, MA 02111-1307 USA
|
|
Phone: +1-617-542-5942
|
|
Fax (including Japan): +1-617-542-2652
|
|
Email: @email{gnu@@gnu.org}
|
|
URL: @uref{http://www.gnu.org}
|
|
@end display
|
|
|
|
@noindent
|
|
Ordering from the FSF directly contributes to the support of the foundation
|
|
and to the production of more free software.
|
|
|
|
@item
|
|
Retrieve @command{gawk} by using anonymous @command{ftp} to the Internet host
|
|
@code{ftp.gnu.org}, in the directory @file{/gnu/gawk}.
|
|
@end itemize
|
|
|
|
The GNU software archive is mirrored around the world.
|
|
The up-to-date list of mirror sites is available from
|
|
@uref{http://www.gnu.org/order/ftp.html, the main FSF web site}.
|
|
Try to use one of the mirrors; they
|
|
will be less busy, and you can usually find one closer to your site.
|
|
|
|
@node Extracting
|
|
@appendixsubsec Extracting the Distribution
|
|
@command{gawk} is distributed as a @code{tar} file compressed with the
|
|
GNU Zip program, @code{gzip}.
|
|
|
|
Once you have the distribution (for example,
|
|
@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}),
|
|
use @code{gzip} to expand the
|
|
file and then use @code{tar} to extract it. You can use the following
|
|
pipeline to produce the @command{gawk} distribution:
|
|
|
|
@example
|
|
# Under System V, add 'o' to the tar options
|
|
gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf -
|
|
@end example
|
|
|
|
@noindent
|
|
This creates a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}
|
|
in the current directory.
|
|
|
|
The distribution @value{FN} is of the form
|
|
@file{gawk-@var{V}.@var{R}.@var{P}.tar.gz}.
|
|
The @var{V} represents the major version of @command{gawk},
|
|
the @var{R} represents the current release of version @var{V}, and
|
|
the @var{P} represents a @dfn{patch level}, meaning that minor bugs have
|
|
been fixed in the release. The current patch level is @value{PATCHLEVEL},
|
|
but when retrieving distributions, you should get the version with the highest
|
|
version, release, and patch level. (Note, however, that patch levels greater than
|
|
or equal to 80 denote ``beta'' or nonproduction software; you might not want
|
|
to retrieve such a version unless you don't mind experimenting.)
|
|
If you are not on a Unix system, you need to make other arrangements
|
|
for getting and extracting the @command{gawk} distribution. You should consult
|
|
a local expert.
|
|
|
|
@node Distribution contents
|
|
@appendixsubsec Contents of the @command{gawk} Distribution
|
|
@c STARTOFRANGE gawdis
|
|
@cindex @command{gawk}, distribution
|
|
|
|
The @command{gawk} distribution has a number of C source files,
|
|
documentation files,
|
|
subdirectories, and files related to the configuration process
|
|
(@pxref{Unix Installation}),
|
|
as well as several subdirectories related to different non-Unix
|
|
operating systems:
|
|
|
|
@table @asis
|
|
@item Various @samp{.c}, @samp{.y}, and @samp{.h} files
|
|
The actual @command{gawk} source code.
|
|
@end table
|
|
|
|
@table @file
|
|
@item README
|
|
@itemx README_d/README.*
|
|
Descriptive files: @file{README} for @command{gawk} under Unix and the
|
|
rest for the various hardware and software combinations.
|
|
|
|
@item INSTALL
|
|
A file providing an overview of the configuration and installation process.
|
|
|
|
@item ChangeLog
|
|
A detailed list of source code changes as bugs are fixed or improvements made.
|
|
|
|
@item NEWS
|
|
A list of changes to @command{gawk} since the last release or patch.
|
|
|
|
@item COPYING
|
|
The GNU General Public License.
|
|
|
|
@item FUTURES
|
|
A brief list of features and changes being contemplated for future
|
|
releases, with some indication of the time frame for the feature, based
|
|
on its difficulty.
|
|
|
|
@item LIMITATIONS
|
|
A list of those factors that limit @command{gawk}'s performance.
|
|
Most of these depend on the hardware or operating system software and
|
|
are not limits in @command{gawk} itself.
|
|
|
|
@item POSIX.STD
|
|
A description of one area in which the POSIX standard for @command{awk} is
|
|
incorrect as well as how @command{gawk} handles the problem.
|
|
|
|
@c comma is part of primary
|
|
@cindex artificial intelligence, @command{gawk} and
|
|
@item doc/awkforai.txt
|
|
A short article describing why @command{gawk} is a good language for
|
|
AI (Artificial Intelligence) programming.
|
|
|
|
@item doc/README.card
|
|
@itemx doc/ad.block
|
|
@itemx doc/awkcard.in
|
|
@itemx doc/cardfonts
|
|
@itemx doc/colors
|
|
@itemx doc/macros
|
|
@itemx doc/no.colors
|
|
@itemx doc/setter.outline
|
|
The @command{troff} source for a five-color @command{awk} reference card.
|
|
A modern version of @command{troff} such as GNU @command{troff} (@command{groff}) is
|
|
needed to produce the color version. See the file @file{README.card}
|
|
for instructions if you have an older @command{troff}.
|
|
|
|
@item doc/gawk.1
|
|
The @command{troff} source for a manual page describing @command{gawk}.
|
|
This is distributed for the convenience of Unix users.
|
|
|
|
@cindex Texinfo
|
|
@item doc/gawk.texi
|
|
The Texinfo source file for this @value{DOCUMENT}.
|
|
It should be processed with @TeX{} to produce a printed document, and
|
|
with @command{makeinfo} to produce an Info or HTML file.
|
|
|
|
@item doc/awk.info
|
|
The generated Info file for this @value{DOCUMENT}.
|
|
|
|
@item doc/gawkinet.texi
|
|
The Texinfo source file for
|
|
@ifinfo
|
|
@xref{Top}.
|
|
@end ifinfo
|
|
@ifnotinfo
|
|
@cite{TCP/IP Internetworking with @command{gawk}}.
|
|
@end ifnotinfo
|
|
It should be processed with @TeX{} to produce a printed document and
|
|
with @command{makeinfo} to produce an Info or HTML file.
|
|
|
|
@item doc/gawkinet.info
|
|
The generated Info file for
|
|
@cite{TCP/IP Internetworking with @command{gawk}}.
|
|
|
|
@item doc/igawk.1
|
|
The @command{troff} source for a manual page describing the @command{igawk}
|
|
program presented in
|
|
@ref{Igawk Program}.
|
|
|
|
@item doc/Makefile.in
|
|
The input file used during the configuration process to generate the
|
|
actual @file{Makefile} for creating the documentation.
|
|
|
|
@item Makefile.am
|
|
@itemx */Makefile.am
|
|
Files used by the GNU @command{automake} software for generating
|
|
the @file{Makefile.in} files used by @command{autoconf} and
|
|
@command{configure}.
|
|
|
|
@item Makefile.in
|
|
@itemx acconfig.h
|
|
@itemx acinclude.m4
|
|
@itemx aclocal.m4
|
|
@itemx configh.in
|
|
@itemx configure.in
|
|
@itemx configure
|
|
@itemx custom.h
|
|
@itemx missing_d/*
|
|
@itemx m4/*
|
|
These files and subdirectories are used when configuring @command{gawk}
|
|
for various Unix systems. They are explained in
|
|
@ref{Unix Installation}.
|
|
|
|
@item intl/*
|
|
@itemx po/*
|
|
The @file{intl} directory provides the GNU @code{gettext} library, which implements
|
|
@command{gawk}'s internationalization features, while the @file{po} library
|
|
contains message translations.
|
|
|
|
@item awklib/extract.awk
|
|
@itemx awklib/Makefile.am
|
|
@itemx awklib/Makefile.in
|
|
@itemx awklib/eg/*
|
|
The @file{awklib} directory contains a copy of @file{extract.awk}
|
|
(@pxref{Extract Program}),
|
|
which can be used to extract the sample programs from the Texinfo
|
|
source file for this @value{DOCUMENT}. It also contains a @file{Makefile.in} file, which
|
|
@command{configure} uses to generate a @file{Makefile}.
|
|
@file{Makefile.am} is used by GNU Automake to create @file{Makefile.in}.
|
|
The library functions from
|
|
@ref{Library Functions},
|
|
and the @command{igawk} program from
|
|
@ref{Igawk Program},
|
|
are included as ready-to-use files in the @command{gawk} distribution.
|
|
They are installed as part of the installation process.
|
|
The rest of the programs in this @value{DOCUMENT} are available in appropriate
|
|
subdirectories of @file{awklib/eg}.
|
|
|
|
@item unsupported/atari/*
|
|
Files needed for building @command{gawk} on an Atari ST
|
|
(@pxref{Atari Installation}, for details).
|
|
|
|
@item unsupported/tandem/*
|
|
Files needed for building @command{gawk} on a Tandem
|
|
(@pxref{Tandem Installation}, for details).
|
|
|
|
@item posix/*
|
|
Files needed for building @command{gawk} on POSIX-compliant systems.
|
|
|
|
@item pc/*
|
|
Files needed for building @command{gawk} under MS-DOS, MS Windows and OS/2
|
|
(@pxref{PC Installation}, for details).
|
|
|
|
@item vms/*
|
|
Files needed for building @command{gawk} under VMS
|
|
(@pxref{VMS Installation}, for details).
|
|
|
|
@item test/*
|
|
A test suite for
|
|
@command{gawk}. You can use @samp{make check} from the top-level @command{gawk}
|
|
directory to run your version of @command{gawk} against the test suite.
|
|
If @command{gawk} successfully passes @samp{make check}, then you can
|
|
be confident of a successful port.
|
|
@end table
|
|
@c ENDOFRANGE gawdis
|
|
|
|
@node Unix Installation
|
|
@appendixsec Compiling and Installing @command{gawk} on Unix
|
|
|
|
Usually, you can compile and install @command{gawk} by typing only two
|
|
commands. However, if you use an unusual system, you may need
|
|
to configure @command{gawk} for your system yourself.
|
|
|
|
@menu
|
|
* Quick Installation:: Compiling @command{gawk} under Unix.
|
|
* Additional Configuration Options:: Other compile-time options.
|
|
* Configuration Philosophy:: How it's all supposed to work.
|
|
@end menu
|
|
|
|
@node Quick Installation
|
|
@appendixsubsec Compiling @command{gawk} for Unix
|
|
|
|
@c @cindex installation, unix
|
|
After you have extracted the @command{gawk} distribution, @command{cd}
|
|
to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. Like most GNU software,
|
|
@command{gawk} is configured
|
|
automatically for your Unix system by running the @command{configure} program.
|
|
This program is a Bourne shell script that is generated automatically using
|
|
GNU @command{autoconf}.
|
|
@ifnotinfo
|
|
(The @command{autoconf} software is
|
|
described fully in
|
|
@cite{Autoconf---Generating Automatic Configuration Scripts},
|
|
which is available from the Free Software Foundation.)
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
(The @command{autoconf} software is described fully starting with
|
|
@ref{Top}.)
|
|
@end ifinfo
|
|
|
|
To configure @command{gawk}, simply run @command{configure}:
|
|
|
|
@example
|
|
sh ./configure
|
|
@end example
|
|
|
|
This produces a @file{Makefile} and @file{config.h} tailored to your system.
|
|
The @file{config.h} file describes various facts about your system.
|
|
You might want to edit the @file{Makefile} to
|
|
change the @code{CFLAGS} variable, which controls
|
|
the command-line options that are passed to the C compiler (such as
|
|
optimization levels or compiling for debugging).
|
|
|
|
Alternatively, you can add your own values for most @command{make}
|
|
variables on the command line, such as @code{CC} and @code{CFLAGS}, when
|
|
running @command{configure}:
|
|
|
|
@example
|
|
CC=cc CFLAGS=-g sh ./configure
|
|
@end example
|
|
|
|
@noindent
|
|
See the file @file{INSTALL} in the @command{gawk} distribution for
|
|
all the details.
|
|
|
|
After you have run @command{configure} and possibly edited the @file{Makefile},
|
|
type:
|
|
|
|
@example
|
|
make
|
|
@end example
|
|
|
|
@noindent
|
|
Shortly thereafter, you should have an executable version of @command{gawk}.
|
|
That's all there is to it!
|
|
To verify that @command{gawk} is working properly,
|
|
run @samp{make check}. All of the tests should succeed.
|
|
If these steps do not work, or if any of the tests fail,
|
|
check the files in the @file{README_d} directory to see if you've
|
|
found a known problem. If the failure is not described there,
|
|
please send in a bug report
|
|
(@pxref{Bugs}.)
|
|
|
|
@node Additional Configuration Options
|
|
@appendixsubsec Additional Configuration Options
|
|
@cindex @command{gawk}, configuring, options
|
|
@c comma is part of primary
|
|
@cindex configuration options, @command{gawk}
|
|
|
|
There are several additional options you may use on the @command{configure}
|
|
command line when compiling @command{gawk} from scratch, including:
|
|
|
|
@table @code
|
|
@cindex @code{--enable-portals} configuration option
|
|
@cindex configuration option, @code{--enable-portals}
|
|
@item --enable-portals
|
|
Treat pathnames that begin
|
|
with @file{/p} as BSD portal files when doing two-way I/O with
|
|
the @samp{|&} operator
|
|
(@pxref{Portal Files}).
|
|
|
|
@cindex @code{--enable-switch} configuration option
|
|
@cindex configuration option, @code{--enable-switch}
|
|
@item --enable-switch
|
|
Enable the recognition and execution of C-style @code{switch} statements
|
|
in @command{awk} programs
|
|
(@pxref{Switch Statement}.)
|
|
|
|
@cindex Linux
|
|
@cindex GNU/Linux
|
|
@cindex @code{--with-included-gettext} configuration option
|
|
@cindex @code{--with-included-gettext} configuration option, configuring @command{gawk} with
|
|
@cindex configuration option, @code{--with-included-gettext}
|
|
@item --with-included-gettext
|
|
Use the version of the @code{gettext} library that comes with @command{gawk}.
|
|
This option should be used on systems that do @emph{not} use @value{PVERSION} 2 (or later)
|
|
of the GNU C library.
|
|
All known modern GNU/Linux systems use Glibc 2. Use this option on any other system.
|
|
|
|
@cindex @code{--disable-lint} configuration option
|
|
@cindex configuration option, @code{--disable-lint}
|
|
@item --disable-lint
|
|
This option disables all lint checking within @code{gawk}. The
|
|
@option{--lint} and @option{--lint-old} options
|
|
(@pxref{Options})
|
|
are accepted, but silently do nothing.
|
|
Similarly, setting the @code{LINT} variable
|
|
(@pxref{User-modified})
|
|
has no effect on the running @command{awk} program.
|
|
|
|
When used with GCC's automatic dead-code-elimination, this option
|
|
cuts almost 200K bytes off the size of the @command{gawk}
|
|
executable on GNU/Linux x86 systems. Results on other systems and
|
|
with other compilers are likely to vary.
|
|
Using this option may bring you some slight performance improvement.
|
|
|
|
Using this option will cause some of the tests in the test suite
|
|
to fail. This option may be removed at a later date.
|
|
|
|
@cindex @code{--disable-nls} configuration option
|
|
@cindex configuration option, @code{--disable-nls}
|
|
@item --disable-nls
|
|
Disable all message-translation facilities.
|
|
This is usually not desirable, but it may bring you some slight performance
|
|
improvement.
|
|
You should also use this option if @option{--with-included-gettext}
|
|
doesn't work on your system.
|
|
@end table
|
|
|
|
@node Configuration Philosophy
|
|
@appendixsubsec The Configuration Process
|
|
|
|
@cindex @command{gawk}, configuring
|
|
This @value{SECTION} is of interest only if you know something about using the
|
|
C language and the Unix operating system.
|
|
|
|
The source code for @command{gawk} generally attempts to adhere to formal
|
|
standards wherever possible. This means that @command{gawk} uses library
|
|
routines that are specified by the ISO C standard and by the POSIX
|
|
operating system interface standard. When using an ISO C compiler,
|
|
function prototypes are used to help improve the compile-time checking.
|
|
|
|
Many Unix systems do not support all of either the ISO or the
|
|
POSIX standards. The @file{missing_d} subdirectory in the @command{gawk}
|
|
distribution contains replacement versions of those functions that are
|
|
most likely to be missing.
|
|
|
|
The @file{config.h} file that @command{configure} creates contains
|
|
definitions that describe features of the particular operating system
|
|
where you are attempting to compile @command{gawk}. The three things
|
|
described by this file are: what header files are available, so that
|
|
they can be correctly included, what (supposedly) standard functions
|
|
are actually available in your C libraries, and various miscellaneous
|
|
facts about your variant of Unix. For example, there may not be an
|
|
@code{st_blksize} element in the @code{stat} structure. In this case,
|
|
@samp{HAVE_ST_BLKSIZE} is undefined.
|
|
|
|
@cindex @code{custom.h} file
|
|
It is possible for your C compiler to lie to @command{configure}. It may
|
|
do so by not exiting with an error when a library function is not
|
|
available. To get around this, edit the file @file{custom.h}.
|
|
Use an @samp{#ifdef} that is appropriate for your system, and either
|
|
@code{#define} any constants that @command{configure} should have defined but
|
|
didn't, or @code{#undef} any constants that @command{configure} defined and
|
|
should not have. @file{custom.h} is automatically included by
|
|
@file{config.h}.
|
|
|
|
It is also possible that the @command{configure} program generated by
|
|
@command{autoconf} will not work on your system in some other fashion.
|
|
If you do have a problem, the file @file{configure.in} is the input for
|
|
@command{autoconf}. You may be able to change this file and generate a
|
|
new version of @command{configure} that works on your system
|
|
(@pxref{Bugs},
|
|
for information on how to report problems in configuring @command{gawk}).
|
|
The same mechanism may be used to send in updates to @file{configure.in}
|
|
and/or @file{custom.h}.
|
|
|
|
@node Non-Unix Installation
|
|
@appendixsec Installation on Other Operating Systems
|
|
|
|
This @value{SECTION} describes how to install @command{gawk} on
|
|
various non-Unix systems.
|
|
|
|
@menu
|
|
* Amiga Installation:: Installing @command{gawk} on an Amiga.
|
|
* BeOS Installation:: Installing @command{gawk} on BeOS.
|
|
* PC Installation:: Installing and Compiling @command{gawk} on
|
|
MS-DOS and OS/2.
|
|
* VMS Installation:: Installing @command{gawk} on VMS.
|
|
@end menu
|
|
|
|
@node Amiga Installation
|
|
@appendixsubsec Installing @command{gawk} on an Amiga
|
|
|
|
@cindex amiga
|
|
@cindex installation, amiga
|
|
You can install @command{gawk} on an Amiga system using a Unix emulation
|
|
environment, available via anonymous @command{ftp} from
|
|
@code{ftp.ninemoons.com} in the directory @file{pub/ade/current}.
|
|
This includes a shell based on @command{pdksh}. The primary component of
|
|
this environment is a Unix emulation library, @file{ixemul.lib}.
|
|
@c could really use more background here, who wrote this, etc.
|
|
|
|
A more complete distribution for the Amiga is available on
|
|
the Geek Gadgets CD-ROM, available from:
|
|
|
|
@display
|
|
CRONUS
|
|
1840 E. Warner Road #105-265
|
|
Tempe, AZ 85284 USA
|
|
US Toll Free: (800) 804-0833
|
|
Phone: +1-602-491-0442
|
|
FAX: +1-602-491-0048
|
|
Email: @email{info@@ninemoons.com}
|
|
WWW: @uref{http://www.ninemoons.com}
|
|
Anonymous @command{ftp} site: @code{ftp.ninemoons.com}
|
|
@end display
|
|
|
|
Once you have the distribution, you can configure @command{gawk} simply by
|
|
running @command{configure}:
|
|
|
|
@example
|
|
configure -v m68k-amigaos
|
|
@end example
|
|
|
|
Then run @command{make} and you should be all set!
|
|
If these steps do not work, please send in a bug report
|
|
(@pxref{Bugs}).
|
|
|
|
@node BeOS Installation
|
|
@appendixsubsec Installing @command{gawk} on BeOS
|
|
@cindex BeOS
|
|
@cindex installation, beos
|
|
|
|
@c From email contributed by Martin Brown, mc@whoever.com
|
|
Since BeOS DR9, all the tools that you should need to build @code{gawk} are
|
|
included with BeOS. The process is basically identical to the Unix process
|
|
of running @command{configure} and then @command{make}. Full instructions are given below.
|
|
|
|
You can compile @command{gawk} under BeOS by extracting the standard sources
|
|
and running @command{configure}. You @emph{must} specify the location
|
|
prefix for the installation directory. For BeOS DR9 and beyond, the best directory to
|
|
use is @file{/boot/home/config}, so the @command{configure} command is:
|
|
|
|
@example
|
|
configure --prefix=/boot/home/config
|
|
@end example
|
|
|
|
This installs the compiled application into @file{/boot/home/config/bin},
|
|
which is already specified in the standard @env{PATH}.
|
|
|
|
Once the configuration process is completed, you can run @command{make},
|
|
and then @samp{make install}:
|
|
|
|
@example
|
|
$ make
|
|
@dots{}
|
|
$ make install
|
|
@end example
|
|
|
|
BeOS uses @command{bash} as its shell; thus, you use @command{gawk} the same way you would
|
|
under Unix.
|
|
If these steps do not work, please send in a bug report
|
|
(@pxref{Bugs}).
|
|
|
|
@c Rewritten by Scott Deifik <scottd@amgen.com>
|
|
@c and Darrel Hankerson <hankedr@mail.auburn.edu>
|
|
|
|
@node PC Installation
|
|
@appendixsubsec Installation on PC Operating Systems
|
|
|
|
@c first comma is part of primary
|
|
@cindex PC operating systems, @command{gawk} on, installing
|
|
@c {PC, gawk on} is the secondary term
|
|
@cindex operating systems, PC, @command{gawk} on, installing
|
|
This @value{SECTION} covers installation and usage of @command{gawk} on x86 machines
|
|
running DOS, any version of Windows, or OS/2.
|
|
In this @value{SECTION}, the term ``Windows32''
|
|
refers to any of Windows-95/98/ME/NT/2000.
|
|
|
|
The limitations of DOS (and DOS shells under Windows or OS/2) has meant
|
|
that various ``DOS extenders'' are often used with programs such as
|
|
@command{gawk}. The varying capabilities of Microsoft Windows 3.1
|
|
and Windows32 can add to the confusion. For an overview of the
|
|
considerations, please refer to @file{README_d/README.pc} in the
|
|
distribution.
|
|
|
|
@menu
|
|
* PC Binary Installation:: Installing a prepared distribution.
|
|
* PC Compiling:: Compiling @command{gawk} for MS-DOS, Windows32,
|
|
and OS/2.
|
|
* PC Dynamic:: Compiling @command{gawk} for dynamic libraries.
|
|
* PC Using:: Running @command{gawk} on MS-DOS, Windows32 and
|
|
OS/2.
|
|
* Cygwin:: Building and running @command{gawk} for
|
|
Cygwin.
|
|
@end menu
|
|
|
|
@node PC Binary Installation
|
|
@appendixsubsubsec Installing a Prepared Distribution for PC Systems
|
|
|
|
If you have received a binary distribution prepared by the DOS
|
|
maintainers, then @command{gawk} and the necessary support files appear
|
|
under the @file{gnu} directory, with executables in @file{gnu/bin},
|
|
libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}.
|
|
This is designed for easy installation to a @file{/gnu} directory on your
|
|
drive---however, the files can be installed anywhere provided @env{AWKPATH} is
|
|
set properly. Regardless of the installation directory, the first line of
|
|
@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be
|
|
edited.
|
|
|
|
The binary distribution contains a separate file describing the
|
|
contents. In particular, it may include more than one version of the
|
|
@command{gawk} executable.
|
|
|
|
OS/2 (32 bit, EMX) binary distributions are prepared for the @file{/usr}
|
|
directory of your preferred drive. Set @env{UNIXROOT} to your installation
|
|
drive (e.g., @samp{e:}) if you want to install @command{gawk} onto another drive
|
|
than the hardcoded default @samp{c:}. Executables appear in @file{/usr/bin},
|
|
libraries under @file{/usr/share/awk}, manual pages under @file{/usr/man},
|
|
Texinfo documentation under @file{/usr/info} and NLS files under @file{/usr/share/locale}.
|
|
If you already have a file @file{/usr/info/dir} from another package
|
|
@emph{do not overwrite it!} Instead enter the following commands at your prompt
|
|
(replace @samp{x:} by your installation drive):
|
|
|
|
@example
|
|
install-info --info-dir=x:/usr/info x:/usr/info/awk.info
|
|
install-info --info-dir=x:/usr/info x:/usr/info/gawkinet.info
|
|
@end example
|
|
|
|
However, the files can be installed anywhere provided @env{AWKPATH} is
|
|
set properly.
|
|
|
|
The binary distribution may contain a separate file containing additional
|
|
or more detailed installation instructions.
|
|
|
|
@node PC Compiling
|
|
@appendixsubsubsec Compiling @command{gawk} for PC Operating Systems
|
|
|
|
@command{gawk} can be compiled for MS-DOS, Windows32, and OS/2 using the GNU
|
|
development tools from DJ Delorie (DJGPP; MS-DOS only) or Eberhard
|
|
Mattes (EMX; MS-DOS, Windows32 and OS/2). Microsoft Visual C/C++ can be used
|
|
to build a Windows32 version, and Microsoft C/C++ can be
|
|
used to build 16-bit versions for MS-DOS and OS/2.
|
|
@c FIXME:
|
|
(As of @command{gawk} 3.1.2, the MSC version doesn't work. However,
|
|
the maintainer is working on fixing it.)
|
|
The file
|
|
@file{README_d/README.pc} in the @command{gawk} distribution contains
|
|
additional notes, and @file{pc/Makefile} contains important information on
|
|
compilation options.
|
|
|
|
To build @command{gawk} for MS-DOS, Windows32, and OS/2 (16 bit only; for 32 bit
|
|
(EMX) you can use the @command{configure} script and skip the following paragraphs;
|
|
for details see below), copy the files in the @file{pc} directory (@emph{except}
|
|
for @file{ChangeLog}) to the directory with the rest of the @command{gawk}
|
|
sources. The @file{Makefile} contains a configuration section with comments and
|
|
may need to be edited in order to work with your @command{make} utility.
|
|
|
|
The @file{Makefile} contains a number of targets for building various MS-DOS,
|
|
Windows32, and OS/2 versions. A list of targets is printed if the @command{make}
|
|
command is given without a target. As an example, to build @command{gawk}
|
|
using the DJGPP tools, enter @samp{make djgpp}.
|
|
|
|
Using @command{make} to run the standard tests and to install @command{gawk}
|
|
requires additional Unix-like tools, including @command{sh}, @command{sed}, and
|
|
@command{cp}. In order to run the tests, the @file{test/*.ok} files may need to
|
|
be converted so that they have the usual DOS-style end-of-line markers. Most
|
|
of the tests work properly with Stewartson's shell along with the
|
|
companion utilities or appropriate GNU utilities. However, some editing of
|
|
@file{test/Makefile} is required. It is recommended that you copy the file
|
|
@file{pc/Makefile.tst} over the file @file{test/Makefile} as a
|
|
replacement. Details can be found in @file{README_d/README.pc}
|
|
and in the file @file{pc/Makefile.tst}.
|
|
|
|
The 32 bit EMX version of @command{gawk} works ``out of the box'' under OS/2.
|
|
In principle, it is possible to compile @command{gawk} the following way:
|
|
|
|
@example
|
|
$ ./configure
|
|
$ make
|
|
@end example
|
|
|
|
This is not recommended, though. To get an OMF executable you should
|
|
use the following commands at your @command{sh} prompt:
|
|
|
|
@example
|
|
$ CPPFLAGS="-D__ST_MT_ERRNO__"
|
|
$ export CPPFLAGS
|
|
$ CFLAGS="-O2 -Zomf -Zmt"
|
|
$ export CFLAGS
|
|
$ LDFLAGS="-s -Zcrtdll -Zlinker /exepack:2 -Zlinker /pm:vio -Zstack 0x8000"
|
|
$ export LDFLAGS
|
|
$ RANLIB="echo"
|
|
$ export RANLIB
|
|
$ ./configure --prefix=c:/usr --without-included-gettext
|
|
$ make AR=emxomfar
|
|
@end example
|
|
|
|
These are just suggestions. You may use any other set of (self-consistent)
|
|
environment variables and compiler flags.
|
|
|
|
To get an FHS-compliant file hierarchy it is recommended to use the additional
|
|
@command{configure} options @option{--infodir=c:/usr/share/info}, @option{--mandir=c:/usr/share/man}
|
|
and @option{--libexecdir=c:/usr/lib}.
|
|
|
|
The internal @code{gettext} library tends to be problematic. It is therefore recommended
|
|
to use either an external one (@option{--without-included-gettext}) or to disable
|
|
NLS entirely (@option{--disable-nls}).
|
|
|
|
If you use GCC 2.95 or newer it is recommended to use also:
|
|
|
|
@example
|
|
$ LIBS="-lgcc"
|
|
$ export LIBS
|
|
@end example
|
|
|
|
You can also get an @code{a.out} executable if you prefer:
|
|
|
|
@example
|
|
$ CPPFLAGS="-D__ST_MT_ERRNO__"
|
|
$ export CPPFLAGS
|
|
$ CFLAGS="-O2 -Zmt"
|
|
$ export CFLAGS
|
|
$ LDFLAGS="-s -Zstack 0x8000"
|
|
$ LIBS="-lgcc"
|
|
$ unset RANLIB
|
|
$ ./configure --prefix=c:/usr --without-included-gettext
|
|
$ make
|
|
@end example
|
|
|
|
@strong{Note:} Even if the compiled @command{gawk.exe} (@code{a.out}) executable
|
|
contains a DOS header, it does @emph{not} work under DOS. To compile an executable
|
|
that runs under DOS, @code{"-DPIPES_SIMULATED"} must be added to @env{CPPFLAGS}.
|
|
But then some nonstandard extensions of @command{gawk} (e.g., @samp{|&}) do not work!
|
|
|
|
After compilation the internal tests can be performed. Enter
|
|
@samp{make check CMP="diff -a"} at your command prompt. All tests
|
|
but the @code{pid} test are expected to work properly. The @code{pid}
|
|
test fails because child processes are not started by @code{fork()}.
|
|
|
|
@samp{make install} works as expected.
|
|
|
|
@strong{Note:} Most OS/2 ports of GNU @command{make} are not able to handle
|
|
the Makefiles of this package. If you encounter any problems with @command{make}
|
|
try GNU Make 3.79.1 or later versions. You should find the latest
|
|
version on @uref{http://www.unixos2.org/sw/pub/binary/make/} or on
|
|
@uref{ftp://hobbes.nmsu.edu/pub/os2/}.
|
|
|
|
@node PC Dynamic
|
|
@appendixsubsubsec Compiling @command{gawk} For Dynamic Libraries
|
|
|
|
@c From README_d/README.pcdynamic
|
|
@c 11 June 2003
|
|
|
|
To compile @command{gawk} with dynamic extension support,
|
|
uncomment the definitions of @code{DYN_FLAGS}, @code{DYN_EXP},
|
|
@code{DYN_OBJ}, and @code{DYN_MAKEXP} in the configuration section of
|
|
the @file{Makefile}. There are two definitions for @code{DYN_MAKEXP}:
|
|
pick the one that matches your target.
|
|
|
|
To build some of the example extension libraries, @command{cd} to the
|
|
extension directory and copy @file{Makefile.pc} to @file{Makefile}. You
|
|
can then build using the same two targets. To run the example
|
|
@command{awk} scripts, you'll need to either change the call to
|
|
the @code{extension} function to match the name of the library (for
|
|
instance, change @code{"./ordchr.so"} to @code{"ordchr.dll"} or simply
|
|
@code{"ordchr"}), or rename the library to match the call (for instance,
|
|
rename @file{ordchr.dll} to @file{ordchr.so}).
|
|
|
|
If you build @command{gawk.exe} with one compiler but want to build
|
|
an extension library with the other, you need to copy the import
|
|
library. Visual C uses a library called @file{gawk.lib}, while MinGW uses
|
|
a library called @file{libgawk.a}. These files are equivalent and will
|
|
interoperate if you give them the correct name. The resulting shared
|
|
libraries are also interoperable.
|
|
|
|
To create your own extension library, you can use the examples as models,
|
|
but you're essentially on your own. Post to @code{comp.lang.awk} or
|
|
send electronic mail to @email{ptjm@@interlog.com} if you have problems getting
|
|
started. If you need to access functions or variables which are not
|
|
exported by @command{gawk.exe}, add them to @file{gawkw32.def} and
|
|
rebuild. You should also add @code{ATTRIBUTE_EXPORTED} to the declaration
|
|
in @file{awk.h} of any variables you add to @file{gawkw32.def}.
|
|
|
|
Note that extension libraries have the name of the @command{awk}
|
|
executable embedded in them at link time, so they will work only
|
|
with @command{gawk.exe}. In particular, they won't work if you
|
|
rename @command{gawk.exe} to @command{awk.exe} or if you try to use
|
|
@command{pgawk.exe}. You can perform profiling by temporarily renaming
|
|
@command{pgawk.exe} to @command{gawk.exe}. You can resolve this problem
|
|
by changing the program name in the definition of @code{DYN_MAKEXP}
|
|
for your compiler.
|
|
|
|
On Windows32, libraries are sought first in the current directory, then in
|
|
the directory containing @command{gawk.exe}, and finally through the
|
|
@env{PATH} environment variable.
|
|
|
|
@node PC Using
|
|
@appendixsubsubsec Using @command{gawk} on PC Operating Systems
|
|
@c STARTOFRANGE opgawx
|
|
@cindex operating systems, PC, @command{gawk} on
|
|
@c STARTOFRANGE pcgawon
|
|
@cindex PC operating systems, @command{gawk} on
|
|
|
|
With the exception of the Cygwin environment,
|
|
the @samp{|&} operator and TCP/IP networking
|
|
(@pxref{TCP/IP Networking})
|
|
are not supported for MS-DOS or MS-Windows. EMX (OS/2 only) does support
|
|
at least the @samp{|&} operator.
|
|
|
|
@cindex search paths
|
|
@cindex @command{gawk}, OS/2 version of
|
|
@cindex @command{gawk}, MS-DOS version of
|
|
@cindex @code{;} (semicolon), @code{AWKPATH} variable and
|
|
@cindex semicolon (@code{;}), @code{AWKPATH} variable and
|
|
@cindex @code{AWKPATH} environment variable
|
|
The OS/2 and MS-DOS versions of @command{gawk} search for program files as
|
|
described in @ref{AWKPATH Variable}.
|
|
However, semicolons (rather than colons) separate elements
|
|
in the @env{AWKPATH} variable. If @env{AWKPATH} is not set or is empty,
|
|
then the default search path for OS/2 (16 bit) and MS-DOS versions is
|
|
@code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}.
|
|
|
|
The search path for OS/2 (32 bit, EMX) is determined by the prefix directory
|
|
(most likely @file{/usr} or @file{c:/usr}) that has been specified as an option of
|
|
the @command{configure} script like it is the case for the Unix versions.
|
|
If @file{c:/usr} is the prefix directory then the default search path contains @file{.}
|
|
and @file{c:/usr/share/awk}.
|
|
Additionally, to support binary distributions of @command{gawk} for OS/2
|
|
systems whose drive @samp{c:} might not support long file names or might not exist
|
|
at all, there is a special environment variable. If @env{UNIXROOT} specifies
|
|
a drive then this specific drive is also searched for program files.
|
|
E.g., if @env{UNIXROOT} is set to @file{e:} the complete default search path is
|
|
@code{@w{".;c:/usr/share/awk;e:/usr/share/awk"}}.
|
|
|
|
An @command{sh}-like shell (as opposed to @command{command.com} under MS-DOS
|
|
or @command{cmd.exe} under OS/2) may be useful for @command{awk} programming.
|
|
Ian Stewartson has written an excellent shell for MS-DOS and OS/2,
|
|
Daisuke Aoyama has ported GNU @command{bash} to MS-DOS using the DJGPP tools,
|
|
and several shells are available for OS/2, including @command{ksh}. The file
|
|
@file{README_d/README.pc} in the @command{gawk} distribution contains
|
|
information on these shells. Users of Stewartson's shell on DOS should
|
|
examine its documentation for handling command lines; in particular,
|
|
the setting for @command{gawk} in the shell configuration may need to be
|
|
changed and the @code{ignoretype} option may also be of interest.
|
|
|
|
@cindex differences in @command{awk} and @command{gawk}, @code{BINMODE} variable
|
|
@cindex @code{BINMODE} variable
|
|
Under OS/2 and DOS, @command{gawk} (and many other text programs) silently
|
|
translate end-of-line @code{"\r\n"} to @code{"\n"} on input and @code{"\n"}
|
|
to @code{"\r\n"} on output. A special @code{BINMODE} variable allows
|
|
control over these translations and is interpreted as follows:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
If @code{BINMODE} is @samp{"r"}, or
|
|
@code{(BINMODE & 1)} is nonzero, then
|
|
binary mode is set on read (i.e., no translations on reads).
|
|
|
|
@item
|
|
If @code{BINMODE} is @code{"w"}, or
|
|
@code{(BINMODE & 2)} is nonzero, then
|
|
binary mode is set on write (i.e., no translations on writes).
|
|
|
|
@item
|
|
If @code{BINMODE} is @code{"rw"} or @code{"wr"},
|
|
binary mode is set for both read and write
|
|
(same as @code{(BINMODE & 3)}).
|
|
|
|
@item
|
|
@code{BINMODE=@var{non-null-string}} is
|
|
the same as @samp{BINMODE=3} (i.e., no translations on
|
|
reads or writes). However, @command{gawk} issues a warning
|
|
message if the string is not one of @code{"rw"} or @code{"wr"}.
|
|
@end itemize
|
|
|
|
@noindent
|
|
The modes for standard input and standard output are set one time
|
|
only (after the
|
|
command line is read, but before processing any of the @command{awk} program).
|
|
Setting @code{BINMODE} for standard input or
|
|
standard output is accomplished by using an
|
|
appropriate @samp{-v BINMODE=@var{N}} option on the command line.
|
|
@code{BINMODE} is set at the time a file or pipe is opened and cannot be
|
|
changed mid-stream.
|
|
|
|
The name @code{BINMODE} was chosen to match @command{mawk}
|
|
(@pxref{Other Versions}).
|
|
Both @command{mawk} and @command{gawk} handle @code{BINMODE} similarly; however,
|
|
@command{mawk} adds a @samp{-W BINMODE=@var{N}} option and an environment
|
|
variable that can set @code{BINMODE}, @code{RS}, and @code{ORS}. The
|
|
files @file{binmode[1-3].awk} (under @file{gnu/lib/awk} in some of the
|
|
prepared distributions) have been chosen to match @command{mawk}'s @samp{-W
|
|
BINMODE=@var{N}} option. These can be changed or discarded; in particular,
|
|
the setting of @code{RS} giving the fewest ``surprises'' is open to debate.
|
|
@command{mawk} uses @samp{RS = "\r\n"} if binary mode is set on read, which is
|
|
appropriate for files with the DOS-style end-of-line.
|
|
|
|
To illustrate, the following examples set binary mode on writes for standard
|
|
output and other files, and set @code{ORS} as the ``usual'' DOS-style
|
|
end-of-line:
|
|
|
|
@example
|
|
gawk -v BINMODE=2 -v ORS="\r\n" @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
or:
|
|
|
|
@example
|
|
gawk -v BINMODE=w -f binmode2.awk @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
These give the same result as the @samp{-W BINMODE=2} option in
|
|
@command{mawk}.
|
|
The following changes the record separator to @code{"\r\n"} and sets binary
|
|
mode on reads, but does not affect the mode on standard input:
|
|
|
|
@example
|
|
gawk -v RS="\r\n" --source "BEGIN @{ BINMODE = 1 @}" @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
or:
|
|
|
|
@example
|
|
gawk -f binmode1.awk @dots{}
|
|
@end example
|
|
|
|
@noindent
|
|
With proper quoting, in the first example the setting of @code{RS} can be
|
|
moved into the @code{BEGIN} rule.
|
|
|
|
@node Cygwin
|
|
@appendixsubsubsec Using @command{gawk} In The Cygwin Environment
|
|
|
|
@command{gawk} can be used ``out of the box'' under Windows if you are
|
|
using the Cygwin environment.@footnote{@uref{http://www.cygwin.com}}
|
|
This environment provides an excellent simulation of Unix, using the
|
|
GNU tools, such as @command{bash}, the GNU Compiler Collection (GCC),
|
|
GNU Make, and other GNU tools. Compilation and installation for Cygwin
|
|
is the same as for a Unix system:
|
|
|
|
@example
|
|
tar -xvpzf gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz
|
|
cd gawk-@value{VERSION}.@value{PATCHLEVEL}
|
|
./configure
|
|
make
|
|
@end example
|
|
|
|
When compared to GNU/Linux on the same system, the @samp{configure}
|
|
step on Cygwin takes considerably longer. However, it does finish,
|
|
and then the @samp{make} proceeds as usual.
|
|
|
|
@strong{Note:} The @samp{|&} operator and TCP/IP networking
|
|
(@pxref{TCP/IP Networking})
|
|
are fully supported in the Cygwin environment. This is not true
|
|
for any other environment for MS-DOS or MS-Windows.
|
|
|
|
@node VMS Installation
|
|
@appendixsubsec How to Compile and Install @command{gawk} on VMS
|
|
|
|
@c based on material from Pat Rankin <rankin@eql.caltech.edu>
|
|
@c now rankin@pactechdata.com
|
|
|
|
@cindex installation, vms
|
|
This @value{SUBSECTION} describes how to compile and install @command{gawk} under VMS.
|
|
|
|
@menu
|
|
* VMS Compilation:: How to compile @command{gawk} under VMS.
|
|
* VMS Installation Details:: How to install @command{gawk} under VMS.
|
|
* VMS Running:: How to run @command{gawk} under VMS.
|
|
* VMS POSIX:: Alternate instructions for VMS POSIX.
|
|
@end menu
|
|
|
|
@node VMS Compilation
|
|
@appendixsubsubsec Compiling @command{gawk} on VMS
|
|
|
|
To compile @command{gawk} under VMS, there is a @code{DCL} command procedure that
|
|
issues all the necessary @code{CC} and @code{LINK} commands. There is
|
|
also a @file{Makefile} for use with the @code{MMS} utility. From the source
|
|
directory, use either:
|
|
|
|
@example
|
|
$ @@[.VMS]VMSBUILD.COM
|
|
@end example
|
|
|
|
@noindent
|
|
or:
|
|
|
|
@example
|
|
$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK
|
|
@end example
|
|
|
|
Depending upon which C compiler you are using, follow one of the sets
|
|
of instructions in this table:
|
|
|
|
@table @asis
|
|
@item VAX C V3.x
|
|
Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use
|
|
@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0.
|
|
|
|
@item VAX C V2.x
|
|
You must have Version 2.3 or 2.4; older ones won't work. Edit either
|
|
@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them.
|
|
For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters.
|
|
Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h})
|
|
and comment out or delete the two lines @samp{#define __STDC__ 0} and
|
|
@samp{#define VAXC_BUILTINS} near the end.
|
|
|
|
@item GNU C
|
|
Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different
|
|
from those for VAX C V2.x but equally straightforward. No changes to
|
|
@file{config.h} are needed.
|
|
|
|
@item DEC C
|
|
Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments.
|
|
No changes to @file{config.h} are needed.
|
|
@end table
|
|
|
|
@command{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2, and
|
|
GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up.
|
|
|
|
@node VMS Installation Details
|
|
@appendixsubsubsec Installing @command{gawk} on VMS
|
|
|
|
To install @command{gawk}, all you need is a ``foreign'' command, which is
|
|
a @code{DCL} symbol whose value begins with a dollar sign. For example:
|
|
|
|
@example
|
|
$ GAWK :== $disk1:[gnubin]GAWK
|
|
@end example
|
|
|
|
@noindent
|
|
Substitute the actual location of @command{gawk.exe} for
|
|
@samp{$disk1:[gnubin]}. The symbol should be placed in the
|
|
@file{login.com} of any user who wants to run @command{gawk},
|
|
so that it is defined every time the user logs on.
|
|
Alternatively, the symbol may be placed in the system-wide
|
|
@file{sylogin.com} procedure, which allows all users
|
|
to run @command{gawk}.
|
|
|
|
Optionally, the help entry can be loaded into a VMS help library:
|
|
|
|
@example
|
|
$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP
|
|
@end example
|
|
|
|
@noindent
|
|
(You may want to substitute a site-specific help library rather than
|
|
the standard VMS library @samp{HELPLIB}.) After loading the help text,
|
|
the command:
|
|
|
|
@example
|
|
$ HELP GAWK
|
|
@end example
|
|
|
|
@noindent
|
|
provides information about both the @command{gawk} implementation and the
|
|
@command{awk} programming language.
|
|
|
|
The logical name @samp{AWK_LIBRARY} can designate a default location
|
|
for @command{awk} program files. For the @option{-f} option, if the specified
|
|
@value{FN} has no device or directory path information in it, @command{gawk}
|
|
looks in the current directory first, then in the directory specified
|
|
by the translation of @samp{AWK_LIBRARY} if the file is not found.
|
|
If, after searching in both directories, the file still is not found,
|
|
@command{gawk} appends the suffix @samp{.awk} to the filename and retries
|
|
the file search. If @samp{AWK_LIBRARY} is not defined, that
|
|
portion of the file search fails benignly.
|
|
|
|
@node VMS Running
|
|
@appendixsubsubsec Running @command{gawk} on VMS
|
|
|
|
Command-line parsing and quoting conventions are significantly different
|
|
on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor
|
|
changes. They @emph{are} minor though, and all @command{awk} programs
|
|
should run correctly.
|
|
|
|
Here are a couple of trivial tests:
|
|
|
|
@example
|
|
$ gawk -- "BEGIN @{print ""Hello, World!""@}"
|
|
$ gawk -"W" version
|
|
! could also be -"W version" or "-W version"
|
|
@end example
|
|
|
|
@noindent
|
|
Note that uppercase and mixed-case text must be quoted.
|
|
|
|
The VMS port of @command{gawk} includes a @code{DCL}-style interface in addition
|
|
to the original shell-style interface (see the help entry for details).
|
|
One side effect of dual command-line parsing is that if there is only a
|
|
single parameter (as in the quoted string program above), the command
|
|
becomes ambiguous. To work around this, the normally optional @option{--}
|
|
flag is required to force Unix style rather than @code{DCL} parsing. If any
|
|
other dash-type options (or multiple parameters such as @value{DF}s to
|
|
process) are present, there is no ambiguity and @option{--} can be omitted.
|
|
|
|
@c @cindex directory search
|
|
@c @cindex path, search
|
|
@cindex search paths
|
|
@cindex search paths, for source files
|
|
The default search path, when looking for @command{awk} program files specified
|
|
by the @option{-f} option, is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical
|
|
name @samp{AWKPATH} can be used to override this default. The format
|
|
of @samp{AWKPATH} is a comma-separated list of directory specifications.
|
|
When defining it, the value should be quoted so that it retains a single
|
|
translation and not a multitranslation @code{RMS} searchlist.
|
|
|
|
@node VMS POSIX
|
|
@appendixsubsubsec Building and Using @command{gawk} on VMS POSIX
|
|
|
|
Ignore the instructions above, although @file{vms/gawk.hlp} should still
|
|
be made available in a help library. The source tree should be unpacked
|
|
into a container file subsystem rather than into the ordinary VMS filesystem.
|
|
Make sure that the two scripts, @file{configure} and
|
|
@file{vms/posix-cc.sh}, are executable; use @samp{chmod +x} on them if
|
|
necessary. Then execute the following two commands:
|
|
|
|
@example
|
|
psx> CC=vms/posix-cc.sh configure
|
|
psx> make CC=c89 gawk
|
|
@end example
|
|
|
|
@noindent
|
|
The first command constructs files @file{config.h} and @file{Makefile} out
|
|
of templates, using a script to make the C compiler fit @command{configure}'s
|
|
expectations. The second command compiles and links @command{gawk} using
|
|
the C compiler directly; ignore any warnings from @command{make} about being
|
|
unable to redefine @code{CC}. @command{configure} takes a very long
|
|
time to execute, but at least it provides incremental feedback as it runs.
|
|
|
|
This has been tested with VAX/VMS V6.2, VMS POSIX V2.0, and DEC C V5.2.
|
|
|
|
Once built, @command{gawk} works like any other shell utility. Unlike
|
|
the normal VMS port of @command{gawk}, no special command-line manipulation is
|
|
needed in the VMS POSIX environment.
|
|
|
|
@node Unsupported
|
|
@appendixsec Unsupported Operating System Ports
|
|
|
|
This sections describes systems for which
|
|
the @command{gawk} port is no longer supported.
|
|
|
|
@menu
|
|
* Atari Installation:: Installing @command{gawk} on the Atari ST.
|
|
* Tandem Installation:: Installing @command{gawk} on a Tandem.
|
|
@end menu
|
|
|
|
@node Atari Installation
|
|
@appendixsubsec Installing @command{gawk} on the Atari ST
|
|
|
|
The Atari port is no longer supported. It is
|
|
included for those who might want to use it but it is no longer being
|
|
actively maintained.
|
|
|
|
@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca>
|
|
@cindex atari
|
|
@cindex installation, atari
|
|
There are no substantial differences when installing @command{gawk} on
|
|
various Atari models. Compiled @command{gawk} executables do not require
|
|
a large amount of memory with most @command{awk} programs, and should run on all
|
|
Motorola processor-based models (called further ST, even if that is not
|
|
exactly right).
|
|
|
|
In order to use @command{gawk}, you need to have a shell, either text or
|
|
graphics, that does not map all the characters of a command line to
|
|
uppercase. Maintaining case distinction in option flags is very
|
|
important (@pxref{Options}).
|
|
These days this is the default and it may only be a problem for some
|
|
very old machines. If your system does not preserve the case of option
|
|
flags, you need to upgrade your tools. Support for I/O
|
|
redirection is necessary to make it easy to import @command{awk} programs
|
|
from other environments. Pipes are nice to have but not vital.
|
|
|
|
@menu
|
|
* Atari Compiling:: Compiling @command{gawk} on Atari.
|
|
* Atari Using:: Running @command{gawk} on Atari.
|
|
@end menu
|
|
|
|
@node Atari Compiling
|
|
@appendixsubsubsec Compiling @command{gawk} on the Atari ST
|
|
|
|
A proper compilation of @command{gawk} sources when @code{sizeof(int)}
|
|
differs from @code{sizeof(void *)} requires an ISO C compiler. An initial
|
|
port was done with @command{gcc}. You may actually prefer executables
|
|
where @code{int}s are four bytes wide but the other variant works as well.
|
|
|
|
You may need quite a bit of memory when trying to recompile the @command{gawk}
|
|
sources, as some source files (@file{regex.c} in particular) are quite
|
|
big. If you run out of memory compiling such a file, try reducing the
|
|
optimization level for this particular file, which may help.
|
|
|
|
@cindex Linux
|
|
@cindex GNU/Linux
|
|
With a reasonable shell (@command{bash} will do), you have a pretty good chance
|
|
that the @command{configure} utility will succeed, and in particular if
|
|
you run GNU/Linux, MiNT or a similar operating system. Otherwise
|
|
sample versions of @file{config.h} and @file{Makefile.st} are given in the
|
|
@file{atari} subdirectory and can be edited and copied to the
|
|
corresponding files in the main source directory. Even if
|
|
@command{configure} produces something, it might be advisable to compare
|
|
its results with the sample versions and possibly make adjustments.
|
|
|
|
Some @command{gawk} source code fragments depend on a preprocessor define
|
|
@samp{atarist}. This basically assumes the TOS environment with @command{gcc}.
|
|
Modify these sections as appropriate if they are not right for your
|
|
environment. Also see the remarks about @env{AWKPATH} and @code{envsep} in
|
|
@ref{Atari Using}.
|
|
|
|
As shipped, the sample @file{config.h} claims that the @code{system}
|
|
function is missing from the libraries, which is not true, and an
|
|
alternative implementation of this function is provided in
|
|
@file{unsupported/atari/system.c}.
|
|
Depending upon your particular combination of
|
|
shell and operating system, you might want to change the file to indicate
|
|
that @code{system} is available.
|
|
|
|
@node Atari Using
|
|
@appendixsubsubsec Running @command{gawk} on the Atari ST
|
|
|
|
An executable version of @command{gawk} should be placed, as usual,
|
|
anywhere in your @env{PATH} where your shell can find it.
|
|
|
|
While executing, the Atari version of @command{gawk} creates a number of temporary files. When
|
|
using @command{gcc} libraries for TOS, @command{gawk} looks for either of
|
|
the environment variables, @env{TEMP} or @env{TMPDIR}, in that order.
|
|
If either one is found, its value is assumed to be a directory for
|
|
temporary files. This directory must exist, and if you can spare the
|
|
memory, it is a good idea to put it on a RAM drive. If neither
|
|
@env{TEMP} nor @env{TMPDIR} are found, then @command{gawk} uses the
|
|
current directory for its temporary files.
|
|
|
|
The ST version of @command{gawk} searches for its program files, as described in
|
|
@ref{AWKPATH Variable}.
|
|
The default value for the @env{AWKPATH} variable is taken from
|
|
@code{DEFPATH} defined in @file{Makefile}. The sample @command{gcc}/TOS
|
|
@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to
|
|
@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be
|
|
modified by explicitly setting @env{AWKPATH} to whatever you want.
|
|
Note that colons cannot be used on the ST to separate elements in the
|
|
@env{AWKPATH} variable, since they have another reserved meaning.
|
|
Instead, you must use a comma to separate elements in the path. When
|
|
recompiling, the separating character can be modified by initializing
|
|
the @code{envsep} variable in @file{unsupported/atari/gawkmisc.atr} to another
|
|
value.
|
|
|
|
Although @command{awk} allows great flexibility in doing I/O redirections
|
|
from within a program, this facility should be used with care on the ST
|
|
running under TOS. In some circumstances, the OS routines for file-handle
|
|
pool processing lose track of certain events, causing the
|
|
computer to crash and requiring a reboot. Often a warm reboot is
|
|
sufficient. Fortunately, this happens infrequently and in rather
|
|
esoteric situations. In particular, avoid having one part of an
|
|
@command{awk} program using @code{print} statements explicitly redirected
|
|
to @file{/dev/stdout}, while other @code{print} statements use the
|
|
default standard output, and a calling shell has redirected standard
|
|
output to a file.
|
|
@c 10/2000: Is this still true, now that gawk does /dev/stdout internally?
|
|
|
|
When @command{gawk} is compiled with the ST version of @command{gcc} and its
|
|
usual libraries, it accepts both @samp{/} and @samp{\} as path separators.
|
|
While this is convenient, it should be remembered that this removes one
|
|
technically valid character (@samp{/}) from your @value{FN}.
|
|
It may also create problems for external programs called via the @code{system}
|
|
function, which may not support this convention. Whenever it is possible
|
|
that a file created by @command{gawk} will be used by some other program,
|
|
use only backslashes. Also remember that in @command{awk}, backslashes in
|
|
strings have to be doubled in order to get literal backslashes
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@node Tandem Installation
|
|
@appendixsubsec Installing @command{gawk} on a Tandem
|
|
@cindex tandem
|
|
@cindex installation, tandem
|
|
|
|
The Tandem port is only minimally supported.
|
|
The port's contributor no longer has access to a Tandem system.
|
|
|
|
@c This section based on README.Tandem by Stephen Davies (scldad@sdc.com.au)
|
|
The Tandem port was done on a Cyclone machine running D20.
|
|
The port is pretty clean and all facilities seem to work except for
|
|
the I/O piping facilities
|
|
(@pxref{Getline/Pipe},
|
|
@ref{Getline/Variable/Pipe},
|
|
and
|
|
@ref{Redirection}),
|
|
which is just too foreign a concept for Tandem.
|
|
|
|
To build a Tandem executable from source, download all of the files so
|
|
that the @value{FN}s on the Tandem box conform to the restrictions of D20.
|
|
For example, @file{array.c} becomes @file{ARRAYC}, and @file{awk.h}
|
|
becomes @file{AWKH}. The totally Tandem-specific files are in the
|
|
@file{tandem} ``subvolume'' (@file{unsupported/tandem} in the @command{gawk}
|
|
distribution) and should be copied to the main source directory before
|
|
building @command{gawk}.
|
|
|
|
The file @file{compit} can then be used to compile and bind an executable.
|
|
Alas, there is no @command{configure} or @command{make}.
|
|
|
|
Usage is the same as for Unix, except that D20 requires all @samp{@{} and
|
|
@samp{@}} characters to be escaped with @samp{~} on the command line
|
|
(but @emph{not} in script files). Also, the standard Tandem syntax for
|
|
@samp{/in filename,out filename/} must be used instead of the usual
|
|
Unix @samp{<} and @samp{>} for file redirection. (Redirection options
|
|
on @code{getline}, @code{print} etc., are supported.)
|
|
|
|
The @samp{-mr @var{val}} option
|
|
(@pxref{Options})
|
|
has been ``stolen'' to enable Tandem users to process fixed-length
|
|
records with no ``end-of-line'' character. That is, @samp{-mr 74} tells
|
|
@command{gawk} to read the input file as fixed 74-byte records.
|
|
@c ENDOFRANGE opgawx
|
|
@c ENDOFRANGE pcgawon
|
|
|
|
@node Bugs
|
|
@appendixsec Reporting Problems and Bugs
|
|
@cindex archeologists
|
|
@quotation
|
|
@i{There is nothing more dangerous than a bored archeologist.}@*
|
|
The Hitchhiker's Guide to the Galaxy
|
|
@end quotation
|
|
@c the radio show, not the book. :-)
|
|
|
|
@c STARTOFRANGE dbugg
|
|
@cindex debugging @command{gawk}, bug reports
|
|
@c STARTOFRANGE tblgawb
|
|
@cindex troubleshooting, @command{gawk}, bug reports
|
|
If you have problems with @command{gawk} or think that you have found a bug,
|
|
please report it to the developers; we cannot promise to do anything
|
|
but we might well want to fix it.
|
|
|
|
Before reporting a bug, make sure you have actually found a real bug.
|
|
Carefully reread the documentation and see if it really says you can do
|
|
what you're trying to do. If it's not clear whether you should be able
|
|
to do something or not, report that too; it's a bug in the documentation!
|
|
|
|
Before reporting a bug or trying to fix it yourself, try to isolate it
|
|
to the smallest possible @command{awk} program and input @value{DF} that
|
|
reproduces the problem. Then send us the program and @value{DF},
|
|
some idea of what kind of Unix system you're using,
|
|
the compiler you used to compile @command{gawk}, and the exact results
|
|
@command{gawk} gave you. Also say what you expected to occur; this helps
|
|
us decide whether the problem is really in the documentation.
|
|
|
|
@cindex @code{bug-gawk@@gnu.org} bug reporting address
|
|
@cindex email address for bug reports, @code{bug-gawk@@gnu.org}
|
|
@cindex bug reports, email address, @code{bug-gawk@@gnu.org}
|
|
Once you have a precise problem, send email to @email{bug-gawk@@gnu.org}.
|
|
|
|
@cindex Robbins, Arnold
|
|
Please include the version number of @command{gawk} you are using.
|
|
You can get this information with the command @samp{gawk --version}.
|
|
Using this address automatically sends a carbon copy of your
|
|
mail to me. If necessary, I can be reached directly at
|
|
@email{arnold@@gnu.org}. The bug reporting address is preferred since the
|
|
email list is archived at the GNU Project.
|
|
@emph{All email should be in English, since that is my native language.}
|
|
|
|
@cindex @code{comp.lang.awk} newsgroup
|
|
@strong{Caution:} Do @emph{not} try to report bugs in @command{gawk} by
|
|
posting to the Usenet/Internet newsgroup @code{comp.lang.awk}.
|
|
While the @command{gawk} developers do occasionally read this newsgroup,
|
|
there is no guarantee that we will see your posting. The steps described
|
|
above are the official recognized ways for reporting bugs.
|
|
|
|
Non-bug suggestions are always welcome as well. If you have questions
|
|
about things that are unclear in the documentation or are just obscure
|
|
features, ask me; I will try to help you out, although I
|
|
may not have the time to fix the problem. You can send me electronic
|
|
mail at the Internet address noted previously.
|
|
|
|
If you find bugs in one of the non-Unix ports of @command{gawk}, please send
|
|
an electronic mail message to the person who maintains that port. They
|
|
are named in the following list, as well as in the @file{README} file in the @command{gawk}
|
|
distribution. Information in the @file{README} file should be considered
|
|
authoritative if it conflicts with this @value{DOCUMENT}.
|
|
|
|
The people maintaining the non-Unix ports of @command{gawk} are
|
|
as follows:
|
|
|
|
@ignore
|
|
@table @asis
|
|
@cindex Fish, Fred
|
|
@item Amiga
|
|
Fred Fish, @email{fnf@@ninemoons.com}.
|
|
|
|
@cindex Brown, Martin
|
|
@item BeOS
|
|
Martin Brown, @email{mc@@whoever.com}.
|
|
|
|
@cindex Deifik, Scott
|
|
@cindex Hankerson, Darrel
|
|
@item MS-DOS
|
|
Scott Deifik, @email{scottd@@amgen.com} and
|
|
Darrel Hankerson, @email{hankedr@@mail.auburn.edu}.
|
|
|
|
@cindex Grigera, Juan
|
|
@item MS-Windows
|
|
Juan Grigera, @email{juan@@biophnet.unlp.edu.ar}.
|
|
|
|
@item OS/2
|
|
The Unix for OS/2 team, @email{gawk-maintainer@@unixos2.org}.
|
|
|
|
@cindex Davies, Stephen
|
|
@item Tandem
|
|
Stephen Davies, @email{scldad@@sdc.com.au}.
|
|
|
|
@cindex Rankin, Pat
|
|
@item VMS
|
|
Pat Rankin, @email{rankin@@pactechdata.com}.
|
|
@end table
|
|
@end ignore
|
|
|
|
@multitable {MS-Windows} {123456789012345678901234567890123456789001234567890}
|
|
@cindex Fish, Fred
|
|
@item Amiga @tab Fred Fish, @email{fnf@@ninemoons.com}.
|
|
|
|
@cindex Brown, Martin
|
|
@item BeOS @tab Martin Brown, @email{mc@@whoever.com}.
|
|
|
|
@cindex Deifik, Scott
|
|
@cindex Hankerson, Darrel
|
|
@item MS-DOS @tab Scott Deifik, @email{scottd@@amgen.com} and
|
|
Darrel Hankerson, @email{hankedr@@mail.auburn.edu}.
|
|
|
|
@cindex Grigera, Juan
|
|
@item MS-Windows @tab Juan Grigera, @email{juan@@biophnet.unlp.edu.ar}.
|
|
|
|
@item OS/2 @tab The Unix for OS/2 team, @email{gawk-maintainer@@unixos2.org}.
|
|
|
|
@cindex Davies, Stephen
|
|
@item Tandem @tab Stephen Davies, @email{scldad@@sdc.com.au}.
|
|
|
|
@cindex Rankin, Pat
|
|
@item VMS @tab Pat Rankin, @email{rankin@@pactechdata.com}.
|
|
@end multitable
|
|
|
|
If your bug is also reproducible under Unix, please send a copy of your
|
|
report to the @email{bug-gawk@@gnu.org} email list as well.
|
|
@c ENDOFRANGE dbugg
|
|
@c ENDOFRANGE tblgawb
|
|
|
|
@node Other Versions
|
|
@appendixsec Other Freely Available @command{awk} Implementations
|
|
@c STARTOFRANGE awkim
|
|
@cindex @command{awk}, implementations
|
|
@ignore
|
|
From: emory!amc.com!brennan (Michael Brennan)
|
|
Subject: C++ comments in awk programs
|
|
To: arnold@gnu.ai.mit.edu (Arnold Robbins)
|
|
Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT)
|
|
|
|
@end ignore
|
|
@cindex Brennan, Michael
|
|
@quotation
|
|
@i{It's kind of fun to put comments like this in your awk code.}@*
|
|
@ @ @ @ @ @ @code{// Do C++ comments work? answer: yes! of course}@*
|
|
Michael Brennan
|
|
@end quotation
|
|
|
|
There are three other freely available @command{awk} implementations.
|
|
This @value{SECTION} briefly describes where to get them:
|
|
|
|
@table @asis
|
|
@cindex Kernighan, Brian
|
|
@cindex source code, Bell Laboratories @command{awk}
|
|
@item Unix @command{awk}
|
|
Brian Kernighan has made his implementation of
|
|
@command{awk} freely available.
|
|
You can retrieve this version via the World Wide Web from
|
|
his home page.@footnote{@uref{http://cm.bell-labs.com/who/bwk}}
|
|
It is available in several archive formats:
|
|
|
|
@table @asis
|
|
@item Shell archive
|
|
@uref{http://cm.bell-labs.com/who/bwk/awk.shar}
|
|
|
|
@item Compressed @command{tar} file
|
|
@uref{http://cm.bell-labs.com/who/bwk/awk.tar.gz}
|
|
|
|
@item Zip file
|
|
@uref{http://cm.bell-labs.com/who/bwk/awk.zip}
|
|
@end table
|
|
|
|
This version requires an ISO C (1990 standard) compiler;
|
|
the C compiler from
|
|
GCC (the GNU Compiler Collection)
|
|
works quite nicely.
|
|
|
|
@xref{BTL},
|
|
for a list of extensions in this @command{awk} that are not in POSIX @command{awk}.
|
|
|
|
@cindex Brennan, Michael
|
|
@cindex @command{mawk} program
|
|
@cindex source code, @command{mawk}
|
|
@item @command{mawk}
|
|
Michael Brennan has written an independent implementation of @command{awk},
|
|
called @command{mawk}. It is available under the GPL
|
|
(@pxref{Copying}),
|
|
just as @command{gawk} is.
|
|
|
|
You can get it via anonymous @command{ftp} to the host
|
|
@code{@w{ftp.whidbey.net}}. Change directory to @file{/pub/brennan}.
|
|
Use ``binary'' or ``image'' mode, and retrieve @file{mawk1.3.3.tar.gz}
|
|
(or the latest version that is there).
|
|
|
|
@command{gunzip} may be used to decompress this file. Installation
|
|
is similar to @command{gawk}'s
|
|
(@pxref{Unix Installation}).
|
|
|
|
@cindex extensions, @command{mawk}
|
|
@command{mawk} has the following extensions that are not in POSIX @command{awk}:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The @code{fflush} built-in function for flushing buffered output
|
|
(@pxref{I/O Functions}).
|
|
|
|
@item
|
|
The @samp{**} and @samp{**=} operators
|
|
(@pxref{Arithmetic Ops}
|
|
and also see
|
|
@ref{Assignment Ops}).
|
|
|
|
@item
|
|
The use of @code{func} as an abbreviation for @code{function}
|
|
(@pxref{Definition Syntax}).
|
|
|
|
@item
|
|
The @samp{\x} escape sequence
|
|
(@pxref{Escape Sequences}).
|
|
|
|
@item
|
|
The @file{/dev/stdout}, and @file{/dev/stderr}
|
|
special files
|
|
(@pxref{Special Files}).
|
|
Use @code{"-"} instead of @code{"/dev/stdin"} with @command{mawk}.
|
|
|
|
@item
|
|
The ability for @code{FS} and for the third
|
|
argument to @code{split} to be null strings
|
|
(@pxref{Single Character Fields}).
|
|
|
|
@item
|
|
The ability to delete all of an array at once with @samp{delete @var{array}}
|
|
(@pxref{Delete}).
|
|
|
|
@item
|
|
The ability for @code{RS} to be a regexp
|
|
(@pxref{Records}).
|
|
|
|
@item
|
|
The @code{BINMODE} special variable for non-Unix operating systems
|
|
(@pxref{PC Using}).
|
|
@end itemize
|
|
|
|
The next version of @command{mawk} will support @code{nextfile}.
|
|
|
|
@cindex Sumner, Andrew
|
|
@cindex @command{awka} compiler for @command{awk}
|
|
@cindex source code, @command{awka}
|
|
@item @command{awka}
|
|
Written by Andrew Sumner,
|
|
@command{awka} translates @command{awk} programs into C, compiles them,
|
|
and links them with a library of functions that provides the core
|
|
@command{awk} functionality.
|
|
It also has a number of extensions.
|
|
|
|
The @command{awk} translator is released under the GPL, and the library
|
|
is under the LGPL.
|
|
|
|
To get @command{awka}, go to @uref{http://awka.sourceforge.net}.
|
|
You can reach Andrew Sumner at @email{andrew@@zbcom.net}.
|
|
|
|
@cindex Beebe, Nelson H.F.
|
|
@cindex @command{pawk} profiling Bell Labs @command{awk}
|
|
@item @command{pawk}
|
|
Nelson H.F.@: Beebe at the University of Utah has modified
|
|
the Bell Labs @command{awk} to provide timing and profiling information.
|
|
It is different from @command{pgawk}
|
|
(@pxref{Profiling}),
|
|
in that it uses CPU-based profiling, not line-count
|
|
profiling. You may find it at either
|
|
@uref{ftp://ftp.math.utah.edu/pub/pawk/pawk-20020210.tar.gz}
|
|
or
|
|
@uref{http://www.math.utah.edu/pub/pawk/pawk-20020210.tar.gz}.
|
|
|
|
@end table
|
|
@c ENDOFRANGE gligawk
|
|
@c ENDOFRANGE ingawk
|
|
@c ENDOFRANGE awkim
|
|
|
|
@node Notes
|
|
@appendix Implementation Notes
|
|
@c STARTOFRANGE gawii
|
|
@cindex @command{gawk}, implementation issues
|
|
@c STARTOFRANGE impis
|
|
@cindex implementation issues, @command{gawk}
|
|
|
|
This appendix contains information mainly of interest to implementors and
|
|
maintainers of @command{gawk}. Everything in it applies specifically to
|
|
@command{gawk} and not to other implementations.
|
|
|
|
@menu
|
|
* Compatibility Mode:: How to disable certain @command{gawk}
|
|
extensions.
|
|
* Additions:: Making Additions To @command{gawk}.
|
|
* Dynamic Extensions:: Adding new built-in functions to
|
|
@command{gawk}.
|
|
* Future Extensions:: New features that may be implemented one day.
|
|
@end menu
|
|
|
|
@node Compatibility Mode
|
|
@appendixsec Downward Compatibility and Debugging
|
|
@cindex @command{gawk}, implementation issues, downward compatibility
|
|
@cindex @command{gawk}, implementation issues, debugging
|
|
@cindex troubleshooting, @command{gawk}
|
|
@c first comma is part of primary
|
|
@cindex implementation issues, @command{gawk}, debugging
|
|
|
|
@xref{POSIX/GNU},
|
|
for a summary of the GNU extensions to the @command{awk} language and program.
|
|
All of these features can be turned off by invoking @command{gawk} with the
|
|
@option{--traditional} option or with the @option{--posix} option.
|
|
|
|
If @command{gawk} is compiled for debugging with @samp{-DDEBUG}, then there
|
|
is one more option available on the command line:
|
|
|
|
@table @code
|
|
@item -W parsedebug
|
|
@itemx --parsedebug
|
|
Prints out the parse stack information as the program is being parsed.
|
|
@end table
|
|
|
|
This option is intended only for serious @command{gawk} developers
|
|
and not for the casual user. It probably has not even been compiled into
|
|
your version of @command{gawk}, since it slows down execution.
|
|
|
|
@node Additions
|
|
@appendixsec Making Additions to @command{gawk}
|
|
|
|
If you find that you want to enhance @command{gawk} in a significant
|
|
fashion, you are perfectly free to do so. That is the point of having
|
|
free software; the source code is available and you are free to change
|
|
it as you want (@pxref{Copying}).
|
|
|
|
This @value{SECTION} discusses the ways you might want to change @command{gawk}
|
|
as well as any considerations you should bear in mind.
|
|
|
|
@menu
|
|
* Adding Code:: Adding code to the main body of
|
|
@command{gawk}.
|
|
* New Ports:: Porting @command{gawk} to a new operating
|
|
system.
|
|
@end menu
|
|
|
|
@node Adding Code
|
|
@appendixsubsec Adding New Features
|
|
|
|
@c STARTOFRANGE adfgaw
|
|
@cindex adding, features to @command{gawk}
|
|
@c STARTOFRANGE fadgaw
|
|
@cindex features, adding to @command{gawk}
|
|
@c STARTOFRANGE gawadf
|
|
@cindex @command{gawk}, features, adding
|
|
You are free to add any new features you like to @command{gawk}.
|
|
However, if you want your changes to be incorporated into the @command{gawk}
|
|
distribution, there are several steps that you need to take in order to
|
|
make it possible for me to include your changes:
|
|
|
|
@enumerate 1
|
|
@item
|
|
Before building the new feature into @command{gawk} itself,
|
|
consider writing it as an extension module
|
|
(@pxref{Dynamic Extensions}).
|
|
If that's not possible, continue with the rest of the steps in this list.
|
|
|
|
@item
|
|
Get the latest version.
|
|
It is much easier for me to integrate changes if they are relative to
|
|
the most recent distributed version of @command{gawk}. If your version of
|
|
@command{gawk} is very old, I may not be able to integrate them at all.
|
|
(@xref{Getting},
|
|
for information on getting the latest version of @command{gawk}.)
|
|
|
|
@item
|
|
@ifnotinfo
|
|
Follow the @cite{GNU Coding Standards}.
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
See @inforef{Top, , Version, standards, GNU Coding Standards}.
|
|
@end ifinfo
|
|
This document describes how GNU software should be written. If you haven't
|
|
read it, please do so, preferably @emph{before} starting to modify @command{gawk}.
|
|
(The @cite{GNU Coding Standards} are available from
|
|
the GNU Project's
|
|
@command{ftp}
|
|
site, at
|
|
@uref{ftp://ftp.gnu.org/gnu/GNUinfo/standards.text}.
|
|
An HTML version, suitable for reading with a WWW browser, is
|
|
available at
|
|
@uref{http://www.gnu.org/prep/standards_toc.html}.
|
|
Texinfo, Info, and DVI versions are also available.)
|
|
|
|
@cindex @command{gawk}, coding style in
|
|
@item
|
|
Use the @command{gawk} coding style.
|
|
The C code for @command{gawk} follows the instructions in the
|
|
@cite{GNU Coding Standards}, with minor exceptions. The code is formatted
|
|
using the traditional ``K&R'' style, particularly as regards to the placement
|
|
of braces and the use of tabs. In brief, the coding rules for @command{gawk}
|
|
are as follows:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Use ANSI/ISO style (prototype) function headers when defining functions.
|
|
|
|
@item
|
|
Put the name of the function at the beginning of its own line.
|
|
|
|
@item
|
|
Put the return type of the function, even if it is @code{int}, on the
|
|
line above the line with the name and arguments of the function.
|
|
|
|
@item
|
|
Put spaces around parentheses used in control structures
|
|
(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch},
|
|
and @code{return}).
|
|
|
|
@item
|
|
Do not put spaces in front of parentheses used in function calls.
|
|
|
|
@item
|
|
Put spaces around all C operators and after commas in function calls.
|
|
|
|
@item
|
|
Do not use the comma operator to produce multiple side effects, except
|
|
in @code{for} loop initialization and increment parts, and in macro bodies.
|
|
|
|
@item
|
|
Use real tabs for indenting, not spaces.
|
|
|
|
@item
|
|
Use the ``K&R'' brace layout style.
|
|
|
|
@item
|
|
Use comparisons against @code{NULL} and @code{'\0'} in the conditions of
|
|
@code{if}, @code{while}, and @code{for} statements, as well as in the @code{case}s
|
|
of @code{switch} statements, instead of just the
|
|
plain pointer or character value.
|
|
|
|
@item
|
|
Use the @code{TRUE}, @code{FALSE} and @code{NULL} symbolic constants
|
|
and the character constant @code{'\0'} where appropriate, instead of @code{1}
|
|
and @code{0}.
|
|
|
|
@item
|
|
Use the @code{ISALPHA}, @code{ISDIGIT}, etc.@: macros, instead of the
|
|
traditional lowercase versions; these macros are better behaved for
|
|
non-ASCII character sets.
|
|
|
|
@item
|
|
Provide one-line descriptive comments for each function.
|
|
|
|
@item
|
|
Do not use @samp{#elif}. Many older Unix C compilers cannot handle it.
|
|
|
|
@item
|
|
Do not use the @code{alloca} function for allocating memory off the stack.
|
|
Its use causes more portability trouble than is worth the minor benefit of not having
|
|
to free the storage. Instead, use @code{malloc} and @code{free}.
|
|
@end itemize
|
|
|
|
@strong{Note:}
|
|
If I have to reformat your code to follow the coding style used in
|
|
@command{gawk}, I may not bother to integrate your changes at all.
|
|
|
|
@item
|
|
Be prepared to sign the appropriate paperwork.
|
|
In order for the FSF to distribute your changes, you must either place
|
|
those changes in the public domain and submit a signed statement to that
|
|
effect, or assign the copyright in your changes to the FSF.
|
|
Both of these actions are easy to do and @emph{many} people have done so
|
|
already. If you have questions, please contact me
|
|
(@pxref{Bugs}),
|
|
or @email{gnu@@gnu.org}.
|
|
|
|
@cindex Texinfo
|
|
@item
|
|
Update the documentation.
|
|
Along with your new code, please supply new sections and/or chapters
|
|
for this @value{DOCUMENT}. If at all possible, please use real
|
|
Texinfo, instead of just supplying unformatted ASCII text (although
|
|
even that is better than no documentation at all).
|
|
Conventions to be followed in @cite{@value{TITLE}} are provided
|
|
after the @samp{@@bye} at the end of the Texinfo source file.
|
|
If possible, please update the @command{man} page as well.
|
|
|
|
You will also have to sign paperwork for your documentation changes.
|
|
|
|
@item
|
|
Submit changes as context diffs or unified diffs.
|
|
Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare
|
|
the original @command{gawk} source tree with your version.
|
|
(I find context diffs to be more readable but unified diffs are
|
|
more compact.)
|
|
I recommend using the GNU version of @command{diff}.
|
|
Send the output produced by either run of @command{diff} to me when you
|
|
submit your changes.
|
|
(@xref{Bugs}, for the electronic mail
|
|
information.)
|
|
|
|
Using this format makes it easy for me to apply your changes to the
|
|
master version of the @command{gawk} source code (using @code{patch}).
|
|
If I have to apply the changes manually, using a text editor, I may
|
|
not do so, particularly if there are lots of changes.
|
|
|
|
@item
|
|
Include an entry for the @file{ChangeLog} file with your submission.
|
|
This helps further minimize the amount of work I have to do,
|
|
making it easier for me to accept patches.
|
|
@end enumerate
|
|
|
|
Although this sounds like a lot of work, please remember that while you
|
|
may write the new code, I have to maintain it and support it. If it
|
|
isn't possible for me to do that with a minimum of extra work, then I
|
|
probably will not.
|
|
@c ENDOFRANGE adfgaw
|
|
@c ENDOFRANGE gawadf
|
|
@c ENDOFRANGE fadgaw
|
|
|
|
@node New Ports
|
|
@appendixsubsec Porting @command{gawk} to a New Operating System
|
|
@cindex portability, @command{gawk}
|
|
@cindex operating systems, porting @command{gawk} to
|
|
|
|
@cindex porting @command{gawk}
|
|
If you want to port @command{gawk} to a new operating system, there are
|
|
several steps:
|
|
|
|
@enumerate 1
|
|
@item
|
|
Follow the guidelines in
|
|
@ifinfo
|
|
@ref{Adding Code},
|
|
@end ifinfo
|
|
@ifnotinfo
|
|
the previous @value{SECTION}
|
|
@end ifnotinfo
|
|
concerning coding style, submission of diffs, and so on.
|
|
|
|
@item
|
|
When doing a port, bear in mind that your code must coexist peacefully
|
|
with the rest of @command{gawk} and the other ports. Avoid gratuitous
|
|
changes to the system-independent parts of the code. If at all possible,
|
|
avoid sprinkling @samp{#ifdef}s just for your port throughout the
|
|
code.
|
|
|
|
If the changes needed for a particular system affect too much of the
|
|
code, I probably will not accept them. In such a case, you can, of course,
|
|
distribute your changes on your own, as long as you comply
|
|
with the GPL
|
|
(@pxref{Copying}).
|
|
|
|
@item
|
|
A number of the files that come with @command{gawk} are maintained by other
|
|
people at the Free Software Foundation. Thus, you should not change them
|
|
unless it is for a very good reason; i.e., changes are not out of the
|
|
question, but changes to these files are scrutinized extra carefully.
|
|
The files are @file{getopt.h}, @file{getopt.c},
|
|
@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h},
|
|
@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}.
|
|
|
|
@item
|
|
Be willing to continue to maintain the port.
|
|
Non-Unix operating systems are supported by volunteers who maintain
|
|
the code needed to compile and run @command{gawk} on their systems. If noone
|
|
volunteers to maintain a port, it becomes unsupported and it may
|
|
be necessary to remove it from the distribution.
|
|
|
|
@item
|
|
Supply an appropriate @file{gawkmisc.???} file.
|
|
Each port has its own @file{gawkmisc.???} that implements certain
|
|
operating system specific functions. This is cleaner than a plethora of
|
|
@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in
|
|
the main source directory includes the appropriate
|
|
@file{gawkmisc.???} file from each subdirectory.
|
|
Be sure to update it as well.
|
|
|
|
Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine
|
|
or operating system for the port---for example, @file{pc/gawkmisc.pc} and
|
|
@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain
|
|
@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory
|
|
into the main subdirectory, without accidentally destroying the real
|
|
@file{gawkmisc.c} file. (Currently, this is only an issue for the
|
|
PC operating system ports.)
|
|
|
|
@item
|
|
Supply a @file{Makefile} as well as any other C source and header files that are
|
|
necessary for your operating system. All your code should be in a
|
|
separate subdirectory, with a name that is the same as, or reminiscent
|
|
of, either your operating system or the computer system. If possible,
|
|
try to structure things so that it is not necessary to move files out
|
|
of the subdirectory into the main source directory. If that is not
|
|
possible, then be sure to avoid using names for your files that
|
|
duplicate the names of files in the main source directory.
|
|
|
|
@item
|
|
Update the documentation.
|
|
Please write a section (or sections) for this @value{DOCUMENT} describing the
|
|
installation and compilation steps needed to compile and/or install
|
|
@command{gawk} for your system.
|
|
|
|
@item
|
|
Be prepared to sign the appropriate paperwork.
|
|
In order for the FSF to distribute your code, you must either place
|
|
your code in the public domain and submit a signed statement to that
|
|
effect, or assign the copyright in your code to the FSF.
|
|
@ifinfo
|
|
Both of these actions are easy to do and @emph{many} people have done so
|
|
already. If you have questions, please contact me, or
|
|
@email{gnu@@gnu.org}.
|
|
@end ifinfo
|
|
@end enumerate
|
|
|
|
Following these steps makes it much easier to integrate your changes
|
|
into @command{gawk} and have them coexist happily with other
|
|
operating systems' code that is already there.
|
|
|
|
In the code that you supply and maintain, feel free to use a
|
|
coding style and brace layout that suits your taste.
|
|
|
|
@node Dynamic Extensions
|
|
@appendixsec Adding New Built-in Functions to @command{gawk}
|
|
@cindex Robinson, Will
|
|
@cindex robot, the
|
|
@cindex Lost In Space
|
|
@quotation
|
|
@i{Danger Will Robinson! Danger!!@*
|
|
Warning! Warning!}@*
|
|
The Robot
|
|
@end quotation
|
|
|
|
@c STARTOFRANGE gladfgaw
|
|
@cindex @command{gawk}, functions, adding
|
|
@c STARTOFRANGE adfugaw
|
|
@cindex adding, functions to @command{gawk}
|
|
@c STARTOFRANGE fubadgaw
|
|
@cindex functions, built-in, adding to @command{gawk}
|
|
Beginning with @command{gawk} 3.1, it is possible to add new built-in
|
|
functions to @command{gawk} using dynamically loaded libraries. This
|
|
facility is available on systems (such as GNU/Linux) that support
|
|
the @code{dlopen} and @code{dlsym} functions.
|
|
This @value{SECTION} describes how to write and use dynamically
|
|
loaded extentions for @command{gawk}.
|
|
Experience with programming in
|
|
C or C++ is necessary when reading this @value{SECTION}.
|
|
|
|
@strong{Caution:} The facilities described in this @value{SECTION}
|
|
are very much subject to change in the next @command{gawk} release.
|
|
Be aware that you may have to re-do everything, perhaps from scratch,
|
|
upon the next release.
|
|
|
|
@menu
|
|
* Internals:: A brief look at some @command{gawk} internals.
|
|
* Sample Library:: A example of new functions.
|
|
@end menu
|
|
|
|
@node Internals
|
|
@appendixsubsec A Minimal Introduction to @command{gawk} Internals
|
|
@c STARTOFRANGE gawint
|
|
@cindex @command{gawk}, internals
|
|
|
|
The truth is that @command{gawk} was not designed for simple extensibility.
|
|
The facilities for adding functions using shared libraries work, but
|
|
are something of a ``bag on the side.'' Thus, this tour is
|
|
brief and simplistic; would-be @command{gawk} hackers are encouraged to
|
|
spend some time reading the source code before trying to write
|
|
extensions based on the material presented here. Of particular note
|
|
are the files @file{awk.h}, @file{builtin.c}, and @file{eval.c}.
|
|
Reading @file{awk.y} in order to see how the parse tree is built
|
|
would also be of use.
|
|
|
|
@cindex @code{awk.h} file (internal)
|
|
With the disclaimers out of the way, the following types, structure
|
|
members, functions, and macros are declared in @file{awk.h} and are of
|
|
use when writing extensions. The next @value{SECTION}
|
|
shows how they are used:
|
|
|
|
@table @code
|
|
@cindex floating-point, numbers, @code{AWKNUM} internal type
|
|
@cindex numbers, floating-point, @code{AWKNUM} internal type
|
|
@cindex @code{AWKNUM} internal type
|
|
@item AWKNUM
|
|
An @code{AWKNUM} is the internal type of @command{awk}
|
|
floating-point numbers. Typically, it is a C @code{double}.
|
|
|
|
@cindex @code{NODE} internal type
|
|
@cindex strings, @code{NODE} internal type
|
|
@cindex numbers, @code{NODE} internal type
|
|
@item NODE
|
|
Just about everything is done using objects of type @code{NODE}.
|
|
These contain both strings and numbers, as well as variables and arrays.
|
|
|
|
@cindex @code{force_number} internal function
|
|
@cindex numeric, values
|
|
@item AWKNUM force_number(NODE *n)
|
|
This macro forces a value to be numeric. It returns the actual
|
|
numeric value contained in the node.
|
|
It may end up calling an internal @command{gawk} function.
|
|
|
|
@cindex @code{force_string} internal function
|
|
@item void force_string(NODE *n)
|
|
This macro guarantees that a @code{NODE}'s string value is current.
|
|
It may end up calling an internal @command{gawk} function.
|
|
It also guarantees that the string is zero-terminated.
|
|
|
|
@c comma is part of primary
|
|
@cindex parameters, number of
|
|
@cindex @code{param_cnt} internal variable
|
|
@item n->param_cnt
|
|
The number of parameters actually passed in a function call at runtime.
|
|
|
|
@cindex @code{stptr} internal variable
|
|
@cindex @code{stlen} internal variable
|
|
@item n->stptr
|
|
@itemx n->stlen
|
|
The data and length of a @code{NODE}'s string value, respectively.
|
|
The string is @emph{not} guaranteed to be zero-terminated.
|
|
If you need to pass the string value to a C library function, save
|
|
the value in @code{n->stptr[n->stlen]}, assign @code{'\0'} to it,
|
|
call the routine, and then restore the value.
|
|
|
|
@cindex @code{type} internal variable
|
|
@item n->type
|
|
The type of the @code{NODE}. This is a C @code{enum}. Values should
|
|
be either @code{Node_var} or @code{Node_var_array} for function
|
|
parameters.
|
|
|
|
@cindex @code{vname} internal variable
|
|
@item n->vname
|
|
The ``variable name'' of a node. This is not of much use inside
|
|
externally written extensions.
|
|
|
|
@cindex arrays, associative, clearing
|
|
@cindex @code{assoc_clear} internal function
|
|
@item void assoc_clear(NODE *n)
|
|
Clears the associative array pointed to by @code{n}.
|
|
Make sure that @samp{n->type == Node_var_array} first.
|
|
|
|
@cindex arrays, elements, installing
|
|
@cindex @code{assoc_lookup} internal function
|
|
@item NODE **assoc_lookup(NODE *symbol, NODE *subs, int reference)
|
|
Finds, and installs if necessary, array elements.
|
|
@code{symbol} is the array, @code{subs} is the subscript.
|
|
This is usually a value created with @code{tmp_string} (see below).
|
|
@code{reference} should be @code{TRUE} if it is an error to use the
|
|
value before it is created. Typically, @code{FALSE} is the
|
|
correct value to use from extension functions.
|
|
|
|
@cindex strings
|
|
@cindex @code{make_string} internal function
|
|
@item NODE *make_string(char *s, size_t len)
|
|
Take a C string and turn it into a pointer to a @code{NODE} that
|
|
can be stored appropriately. This is permanent storage; understanding
|
|
of @command{gawk} memory management is helpful.
|
|
|
|
@cindex numbers
|
|
@cindex @code{make_number} internal function
|
|
@item NODE *make_number(AWKNUM val)
|
|
Take an @code{AWKNUM} and turn it into a pointer to a @code{NODE} that
|
|
can be stored appropriately. This is permanent storage; understanding
|
|
of @command{gawk} memory management is helpful.
|
|
|
|
@cindex @code{tmp_string} internal function
|
|
@item NODE *tmp_string(char *s, size_t len);
|
|
Take a C string and turn it into a pointer to a @code{NODE} that
|
|
can be stored appropriately. This is temporary storage; understanding
|
|
of @command{gawk} memory management is helpful.
|
|
|
|
@cindex @code{tmp_number} internal function
|
|
@item NODE *tmp_number(AWKNUM val)
|
|
Take an @code{AWKNUM} and turn it into a pointer to a @code{NODE} that
|
|
can be stored appropriately. This is temporary storage;
|
|
understanding of @command{gawk} memory management is helpful.
|
|
|
|
@c comma is part of primary
|
|
@cindex nodes, duplicating
|
|
@cindex @code{dupnode} internal function
|
|
@item NODE *dupnode(NODE *n)
|
|
Duplicate a node. In most cases, this increments an internal
|
|
reference count instead of actually duplicating the entire @code{NODE};
|
|
understanding of @command{gawk} memory management is helpful.
|
|
|
|
@cindex memory, releasing
|
|
@cindex @code{free_temp} internal macro
|
|
@item void free_temp(NODE *n)
|
|
This macro releases the memory associated with a @code{NODE}
|
|
allocated with @code{tmp_string} or @code{tmp_number}.
|
|
Understanding of @command{gawk} memory management is helpful.
|
|
|
|
@cindex @code{make_builtin} internal function
|
|
@item void make_builtin(char *name, NODE *(*func)(NODE *), int count)
|
|
Register a C function pointed to by @code{func} as new built-in
|
|
function @code{name}. @code{name} is a regular C string. @code{count}
|
|
is the maximum number of arguments that the function takes.
|
|
The function should be written in the following manner:
|
|
|
|
@example
|
|
/* do_xxx --- do xxx function for gawk */
|
|
|
|
NODE *
|
|
do_xxx(NODE *tree)
|
|
@{
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
@cindex arguments, retrieving
|
|
@cindex @code{get_argument} internal function
|
|
@item NODE *get_argument(NODE *tree, int i)
|
|
This function is called from within a C extension function to get
|
|
the @code{i}-th argument from the function call.
|
|
The first argument is argument zero.
|
|
|
|
@c last comma is part of secondary
|
|
@cindex functions, return values, setting
|
|
@cindex @code{set_value} internal function
|
|
@item void set_value(NODE *tree)
|
|
This function is called from within a C extension function to set
|
|
the return value from the extension function. This value is
|
|
what the @command{awk} program sees as the return value from the
|
|
new @command{awk} function.
|
|
|
|
@cindex @code{ERRNO} variable
|
|
@cindex @code{update_ERRNO} internal function
|
|
@item void update_ERRNO(void)
|
|
This function is called from within a C extension function to set
|
|
the value of @command{gawk}'s @code{ERRNO} variable, based on the current
|
|
value of the C @code{errno} variable.
|
|
It is provided as a convenience.
|
|
@end table
|
|
|
|
An argument that is supposed to be an array needs to be handled with
|
|
some extra code, in case the array being passed in is actually
|
|
from a function parameter.
|
|
|
|
In versions of @command{gawk} up to and including 3.1.2, the
|
|
following boilerplate code shows how to do this:
|
|
|
|
@smallexample
|
|
NODE *the_arg;
|
|
|
|
the_arg = get_argument(tree, 2); /* assume need 3rd arg, 0-based */
|
|
|
|
/* if a parameter, get it off the stack */
|
|
if (the_arg->type == Node_param_list)
|
|
the_arg = stack_ptr[the_arg->param_cnt];
|
|
|
|
/* parameter referenced an array, get it */
|
|
if (the_arg->type == Node_array_ref)
|
|
the_arg = the_arg->orig_array;
|
|
|
|
/* check type */
|
|
if (the_arg->type != Node_var && the_arg->type != Node_var_array)
|
|
fatal("newfunc: third argument is not an array");
|
|
|
|
/* force it to be an array, if necessary, clear it */
|
|
the_arg->type = Node_var_array;
|
|
assoc_clear(the_arg);
|
|
@end smallexample
|
|
|
|
For versions 3.1.3 and later, the internals changed. In particular,
|
|
the interface was actually @emph{simplified} drastically. The
|
|
following boilerplate code now suffices:
|
|
|
|
@smallexample
|
|
NODE *the_arg;
|
|
|
|
the_arg = get_argument(tree, 2); /* assume need 3rd arg, 0-based */
|
|
|
|
/* force it to be an array: */
|
|
the_arg = get_array(the_arg);
|
|
|
|
/* if necessary, clear it: */
|
|
assoc_clear(the_arg);
|
|
@end smallexample
|
|
|
|
Again, you should spend time studying the @command{gawk} internals;
|
|
don't just blindly copy this code.
|
|
@c ENDOFRANGE gawint
|
|
|
|
@node Sample Library
|
|
@appendixsubsec Directory and File Operation Built-ins
|
|
@c comma is part of primary
|
|
@c STARTOFRANGE chdirg
|
|
@cindex @code{chdir} function, implementing in @command{gawk}
|
|
@c comma is part of primary
|
|
@c STARTOFRANGE statg
|
|
@cindex @code{stat} function, implementing in @command{gawk}
|
|
@c last comma is part of secondary
|
|
@c STARTOFRANGE filre
|
|
@cindex files, information about, retrieving
|
|
@c STARTOFRANGE dirch
|
|
@cindex directories, changing
|
|
|
|
Two useful functions that are not in @command{awk} are @code{chdir}
|
|
(so that an @command{awk} program can change its directory) and
|
|
@code{stat} (so that an @command{awk} program can gather information about
|
|
a file).
|
|
This @value{SECTION} implements these functions for @command{gawk} in an
|
|
external extension library.
|
|
|
|
@menu
|
|
* Internal File Description:: What the new functions will do.
|
|
* Internal File Ops:: The code for internal file operations.
|
|
* Using Internal File Ops:: How to use an external extension.
|
|
@end menu
|
|
|
|
@node Internal File Description
|
|
@appendixsubsubsec Using @code{chdir} and @code{stat}
|
|
|
|
This @value{SECTION} shows how to use the new functions at the @command{awk}
|
|
level once they've been integrated into the running @command{gawk}
|
|
interpreter.
|
|
Using @code{chdir} is very straightforward. It takes one argument,
|
|
the new directory to change to:
|
|
|
|
@example
|
|
@dots{}
|
|
newdir = "/home/arnold/funstuff"
|
|
ret = chdir(newdir)
|
|
if (ret < 0) @{
|
|
printf("could not change to %s: %s\n",
|
|
newdir, ERRNO) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
@dots{}
|
|
@end example
|
|
|
|
The return value is negative if the @code{chdir} failed,
|
|
and @code{ERRNO}
|
|
(@pxref{Built-in Variables})
|
|
is set to a string indicating the error.
|
|
|
|
Using @code{stat} is a bit more complicated.
|
|
The C @code{stat} function fills in a structure that has a fair
|
|
amount of information.
|
|
The right way to model this in @command{awk} is to fill in an associative
|
|
array with the appropriate information:
|
|
|
|
@c broke printf for page breaking
|
|
@example
|
|
file = "/home/arnold/.profile"
|
|
fdata[1] = "x" # force `fdata' to be an array
|
|
ret = stat(file, fdata)
|
|
if (ret < 0) @{
|
|
printf("could not stat %s: %s\n",
|
|
file, ERRNO) > "/dev/stderr"
|
|
exit 1
|
|
@}
|
|
printf("size of %s is %d bytes\n", file, fdata["size"])
|
|
@end example
|
|
|
|
The @code{stat} function always clears the data array, even if
|
|
the @code{stat} fails. It fills in the following elements:
|
|
|
|
@table @code
|
|
@item "name"
|
|
The name of the file that was @code{stat}'ed.
|
|
|
|
@item "dev"
|
|
@itemx "ino"
|
|
The file's device and inode numbers, respectively.
|
|
|
|
@item "mode"
|
|
The file's mode, as a numeric value. This includes both the file's
|
|
type and its permissions.
|
|
|
|
@item "nlink"
|
|
The number of hard links (directory entries) the file has.
|
|
|
|
@item "uid"
|
|
@itemx "gid"
|
|
The numeric user and group ID numbers of the file's owner.
|
|
|
|
@item "size"
|
|
The size in bytes of the file.
|
|
|
|
@item "blocks"
|
|
The number of disk blocks the file actually occupies. This may not
|
|
be a function of the file's size if the file has holes.
|
|
|
|
@item "atime"
|
|
@itemx "mtime"
|
|
@itemx "ctime"
|
|
The file's last access, modification, and inode update times,
|
|
respectively. These are numeric timestamps, suitable for formatting
|
|
with @code{strftime}
|
|
(@pxref{Built-in}).
|
|
|
|
@item "pmode"
|
|
The file's ``printable mode.'' This is a string representation of
|
|
the file's type and permissions, such as what is produced by
|
|
@samp{ls -l}---for example, @code{"drwxr-xr-x"}.
|
|
|
|
@item "type"
|
|
A printable string representation of the file's type. The value
|
|
is one of the following:
|
|
|
|
@table @code
|
|
@item "blockdev"
|
|
@itemx "chardev"
|
|
The file is a block or character device (``special file'').
|
|
|
|
@ignore
|
|
@item "door"
|
|
The file is a Solaris ``door'' (special file used for
|
|
interprocess communications).
|
|
@end ignore
|
|
|
|
@item "directory"
|
|
The file is a directory.
|
|
|
|
@item "fifo"
|
|
The file is a named-pipe (also known as a FIFO).
|
|
|
|
@item "file"
|
|
The file is just a regular file.
|
|
|
|
@item "socket"
|
|
The file is an @code{AF_UNIX} (``Unix domain'') socket in the
|
|
filesystem.
|
|
|
|
@item "symlink"
|
|
The file is a symbolic link.
|
|
@end table
|
|
@end table
|
|
|
|
Several additional elements may be present depending upon the operating
|
|
system and the type of the file. You can test for them in your @command{awk}
|
|
program by using the @code{in} operator
|
|
(@pxref{Reference to Elements}):
|
|
|
|
@table @code
|
|
@item "blksize"
|
|
The preferred block size for I/O to the file. This field is not
|
|
present on all POSIX-like systems in the C @code{stat} structure.
|
|
|
|
@item "linkval"
|
|
If the file is a symbolic link, this element is the name of the
|
|
file the link points to (i.e., the value of the link).
|
|
|
|
@item "rdev"
|
|
@itemx "major"
|
|
@itemx "minor"
|
|
If the file is a block or character device file, then these values
|
|
represent the numeric device number and the major and minor components
|
|
of that number, respectively.
|
|
@end table
|
|
|
|
@node Internal File Ops
|
|
@appendixsubsubsec C Code for @code{chdir} and @code{stat}
|
|
|
|
Here is the C code for these extensions. They were written for
|
|
GNU/Linux. The code needs some more work for complete portability
|
|
to other POSIX-compliant systems:@footnote{This version is edited
|
|
slightly for presentation. The complete version can be found in
|
|
@file{extension/filefuncs.c} in the @command{gawk} distribution.}
|
|
|
|
@c break line for page breaking
|
|
@example
|
|
#include "awk.h"
|
|
|
|
#include <sys/sysmacros.h>
|
|
|
|
/* do_chdir --- provide dynamically loaded
|
|
chdir() builtin for gawk */
|
|
|
|
static NODE *
|
|
do_chdir(tree)
|
|
NODE *tree;
|
|
@{
|
|
NODE *newdir;
|
|
int ret = -1;
|
|
|
|
newdir = get_argument(tree, 0);
|
|
@end example
|
|
|
|
The file includes the @code{"awk.h"} header file for definitions
|
|
for the @command{gawk} internals. It includes @code{<sys/sysmacros.h>}
|
|
for access to the @code{major} and @code{minor} macros.
|
|
|
|
@cindex programming conventions, @command{gawk} internals
|
|
By convention, for an @command{awk} function @code{foo}, the function that
|
|
implements it is called @samp{do_foo}. The function should take
|
|
a @samp{NODE *} argument, usually called @code{tree}, that
|
|
represents the argument list to the function. The @code{newdir}
|
|
variable represents the new directory to change to, retrieved
|
|
with @code{get_argument}. Note that the first argument is
|
|
numbered zero.
|
|
|
|
This code actually accomplishes the @code{chdir}. It first forces
|
|
the argument to be a string and passes the string value to the
|
|
@code{chdir} system call. If the @code{chdir} fails, @code{ERRNO}
|
|
is updated.
|
|
The result of @code{force_string} has to be freed with @code{free_temp}:
|
|
|
|
@example
|
|
if (newdir != NULL) @{
|
|
(void) force_string(newdir);
|
|
ret = chdir(newdir->stptr);
|
|
if (ret < 0)
|
|
update_ERRNO();
|
|
|
|
free_temp(newdir);
|
|
@}
|
|
@end example
|
|
|
|
Finally, the function returns the return value to the @command{awk} level,
|
|
using @code{set_value}. Then it must return a value from the call to
|
|
the new built-in (this value ignored by the interpreter):
|
|
|
|
@example
|
|
/* Set the return value */
|
|
set_value(tmp_number((AWKNUM) ret));
|
|
|
|
/* Just to make the interpreter happy */
|
|
return tmp_number((AWKNUM) 0);
|
|
@}
|
|
@end example
|
|
|
|
The @code{stat} built-in is more involved. First comes a function
|
|
that turns a numeric mode into a printable representation
|
|
(e.g., 644 becomes @samp{-rw-r--r--}). This is omitted here for brevity:
|
|
|
|
@c break line for page breaking
|
|
@example
|
|
/* format_mode --- turn a stat mode field
|
|
into something readable */
|
|
|
|
static char *
|
|
format_mode(fmode)
|
|
unsigned long fmode;
|
|
@{
|
|
@dots{}
|
|
@}
|
|
@end example
|
|
|
|
Next comes the actual @code{do_stat} function itself. First come the
|
|
variable declarations and argument checking:
|
|
|
|
@ignore
|
|
Changed message for page breaking. Used to be:
|
|
"stat: called with incorrect number of arguments (%d), should be 2",
|
|
@end ignore
|
|
@example
|
|
/* do_stat --- provide a stat() function for gawk */
|
|
|
|
static NODE *
|
|
do_stat(tree)
|
|
NODE *tree;
|
|
@{
|
|
NODE *file, *array;
|
|
struct stat sbuf;
|
|
int ret;
|
|
char *msg;
|
|
NODE **aptr;
|
|
char *pmode; /* printable mode */
|
|
char *type = "unknown";
|
|
|
|
/* check arg count */
|
|
if (tree->param_cnt != 2)
|
|
fatal(
|
|
"stat: called with %d arguments, should be 2",
|
|
tree->param_cnt);
|
|
@end example
|
|
|
|
Then comes the actual work. First, we get the arguments.
|
|
Then, we always clear the array. To get the file information,
|
|
we use @code{lstat}, in case the file is a symbolic link.
|
|
If there's an error, we set @code{ERRNO} and return:
|
|
|
|
@c comment made multiline for page breaking
|
|
@example
|
|
/*
|
|
* directory is first arg,
|
|
* array to hold results is second
|
|
*/
|
|
file = get_argument(tree, 0);
|
|
array = get_argument(tree, 1);
|
|
|
|
/* empty out the array */
|
|
assoc_clear(array);
|
|
|
|
/* lstat the file, if error, set ERRNO and return */
|
|
(void) force_string(file);
|
|
ret = lstat(file->stptr, & sbuf);
|
|
if (ret < 0) @{
|
|
update_ERRNO();
|
|
|
|
set_value(tmp_number((AWKNUM) ret));
|
|
|
|
free_temp(file);
|
|
return tmp_number((AWKNUM) 0);
|
|
@}
|
|
@end example
|
|
|
|
Now comes the tedious part: filling in the array. Only a few of the
|
|
calls are shown here, since they all follow the same pattern:
|
|
|
|
@example
|
|
/* fill in the array */
|
|
aptr = assoc_lookup(array, tmp_string("name", 4), FALSE);
|
|
*aptr = dupnode(file);
|
|
|
|
aptr = assoc_lookup(array, tmp_string("mode", 4), FALSE);
|
|
*aptr = make_number((AWKNUM) sbuf.st_mode);
|
|
|
|
aptr = assoc_lookup(array, tmp_string("pmode", 5), FALSE);
|
|
pmode = format_mode(sbuf.st_mode);
|
|
*aptr = make_string(pmode, strlen(pmode));
|
|
@end example
|
|
|
|
When done, we free the temporary value containing the @value{FN},
|
|
set the return value, and return:
|
|
|
|
@example
|
|
free_temp(file);
|
|
|
|
/* Set the return value */
|
|
set_value(tmp_number((AWKNUM) ret));
|
|
|
|
/* Just to make the interpreter happy */
|
|
return tmp_number((AWKNUM) 0);
|
|
@}
|
|
@end example
|
|
|
|
@cindex programming conventions, @command{gawk} internals
|
|
Finally, it's necessary to provide the ``glue'' that loads the
|
|
new function(s) into @command{gawk}. By convention, each library has
|
|
a routine named @code{dlload} that does the job:
|
|
|
|
@example
|
|
/* dlload --- load new builtins in this library */
|
|
|
|
NODE *
|
|
dlload(tree, dl)
|
|
NODE *tree;
|
|
void *dl;
|
|
@{
|
|
make_builtin("chdir", do_chdir, 1);
|
|
make_builtin("stat", do_stat, 2);
|
|
return tmp_number((AWKNUM) 0);
|
|
@}
|
|
@end example
|
|
|
|
And that's it! As an exercise, consider adding functions to
|
|
implement system calls such as @code{chown}, @code{chmod}, and @code{umask}.
|
|
|
|
@node Using Internal File Ops
|
|
@appendixsubsubsec Integrating the Extensions
|
|
|
|
@c last comma is part of secondary
|
|
@cindex @command{gawk}, interpreter, adding code to
|
|
Now that the code is written, it must be possible to add it at
|
|
runtime to the running @command{gawk} interpreter. First, the
|
|
code must be compiled. Assuming that the functions are in
|
|
a file named @file{filefuncs.c}, and @var{idir} is the location
|
|
of the @command{gawk} include files,
|
|
the following steps create
|
|
a GNU/Linux shared library:
|
|
|
|
@example
|
|
$ gcc -shared -DHAVE_CONFIG_H -c -O -g -I@var{idir} filefuncs.c
|
|
$ ld -o filefuncs.so -shared filefuncs.o
|
|
@end example
|
|
|
|
@cindex @code{extension} function (@command{gawk})
|
|
Once the library exists, it is loaded by calling the @code{extension}
|
|
built-in function.
|
|
This function takes two arguments: the name of the
|
|
library to load and the name of a function to call when the library
|
|
is first loaded. This function adds the new functions to @command{gawk}.
|
|
It returns the value returned by the initialization function
|
|
within the shared library:
|
|
|
|
@example
|
|
# file testff.awk
|
|
BEGIN @{
|
|
extension("./filefuncs.so", "dlload")
|
|
|
|
chdir(".") # no-op
|
|
|
|
data[1] = 1 # force `data' to be an array
|
|
print "Info for testff.awk"
|
|
ret = stat("testff.awk", data)
|
|
print "ret =", ret
|
|
for (i in data)
|
|
printf "data[\"%s\"] = %s\n", i, data[i]
|
|
print "testff.awk modified:",
|
|
strftime("%m %d %y %H:%M:%S", data["mtime"])
|
|
@}
|
|
@end example
|
|
|
|
Here are the results of running the program:
|
|
|
|
@example
|
|
$ gawk -f testff.awk
|
|
@print{} Info for testff.awk
|
|
@print{} ret = 0
|
|
@print{} data["blksize"] = 4096
|
|
@print{} data["mtime"] = 932361936
|
|
@print{} data["mode"] = 33188
|
|
@print{} data["type"] = file
|
|
@print{} data["dev"] = 2065
|
|
@print{} data["gid"] = 10
|
|
@print{} data["ino"] = 878597
|
|
@print{} data["ctime"] = 971431797
|
|
@print{} data["blocks"] = 2
|
|
@print{} data["nlink"] = 1
|
|
@print{} data["name"] = testff.awk
|
|
@print{} data["atime"] = 971608519
|
|
@print{} data["pmode"] = -rw-r--r--
|
|
@print{} data["size"] = 607
|
|
@print{} data["uid"] = 2076
|
|
@print{} testff.awk modified: 07 19 99 08:25:36
|
|
@end example
|
|
@c ENDOFRANGE filre
|
|
@c ENDOFRANGE dirch
|
|
@c ENDOFRANGE statg
|
|
@c ENDOFRANGE chdirg
|
|
@c ENDOFRANGE gladfgaw
|
|
@c ENDOFRANGE adfugaw
|
|
@c ENDOFRANGE fubadgaw
|
|
|
|
@node Future Extensions
|
|
@appendixsec Probable Future Extensions
|
|
@ignore
|
|
From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995
|
|
Return-Path: <emory!scalpel.netlabs.com!lwall>
|
|
Message-Id: <9510311732.AA28472@scalpel.netlabs.com>
|
|
To: arnold@skeeve.atl.ga.us (Arnold D. Robbins)
|
|
Subject: Re: May I quote you?
|
|
In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST."
|
|
<m0tAHPQ-00014MC@skeeve.atl.ga.us>
|
|
Date: Tue, 31 Oct 95 09:32:46 -0800
|
|
From: Larry Wall <emory!scalpel.netlabs.com!lwall>
|
|
|
|
: Greetings. I am working on the release of gawk 3.0. Part of it will be a
|
|
: thoroughly updated manual. One of the sections deals with planned future
|
|
: extensions and enhancements. I have the following at the beginning
|
|
: of it:
|
|
:
|
|
: @cindex PERL
|
|
: @cindex Wall, Larry
|
|
: @display
|
|
: @i{AWK is a language similar to PERL, only considerably more elegant.} @*
|
|
: Arnold Robbins
|
|
: @sp 1
|
|
: @i{Hey!} @*
|
|
: Larry Wall
|
|
: @end display
|
|
:
|
|
: Before I actually release this for publication, I wanted to get your
|
|
: permission to quote you. (Hopefully, in the spirit of much of GNU, the
|
|
: implied humor is visible... :-)
|
|
|
|
I think that would be fine.
|
|
|
|
Larry
|
|
@end ignore
|
|
@cindex PERL
|
|
@cindex Wall, Larry
|
|
@cindex Robbins, Arnold
|
|
@quotation
|
|
@i{AWK is a language similar to PERL, only considerably more elegant.}@*
|
|
Arnold Robbins
|
|
|
|
@i{Hey!}@*
|
|
Larry Wall
|
|
@end quotation
|
|
|
|
This @value{SECTION} briefly lists extensions and possible improvements
|
|
that indicate the directions we are
|
|
currently considering for @command{gawk}. The file @file{FUTURES} in the
|
|
@command{gawk} distribution lists these extensions as well.
|
|
|
|
Following is a list of probable future changes visible at the
|
|
@command{awk} language level:
|
|
|
|
@c these are ordered by likelihood
|
|
@table @asis
|
|
@item Loadable module interface
|
|
It is not clear that the @command{awk}-level interface to the
|
|
modules facility is as good as it should be. The interface needs to be
|
|
redesigned, particularly taking namespace issues into account, as
|
|
well as possibly including issues such as library search path order
|
|
and versioning.
|
|
|
|
@item @code{RECLEN} variable for fixed-length records
|
|
Along with @code{FIELDWIDTHS}, this would speed up the processing of
|
|
fixed-length records.
|
|
@code{PROCINFO["RS"]} would be @code{"RS"} or @code{"RECLEN"},
|
|
depending upon which kind of record processing is in effect.
|
|
|
|
@item Additional @code{printf} specifiers
|
|
The 1999 ISO C standard added a number of additional @code{printf}
|
|
format specifiers. These should be evaluated for possible inclusion
|
|
in @command{gawk}.
|
|
|
|
@ignore
|
|
@item A @samp{%'d} flag
|
|
Add @samp{%'d} for putting in commas in formatting numeric values.
|
|
@end ignore
|
|
|
|
@item Databases
|
|
It may be possible to map a GDBM/NDBM/SDBM file into an @command{awk} array.
|
|
|
|
@item Large character sets
|
|
It would be nice if @command{gawk} could handle UTF-8 and other
|
|
character sets that are larger than eight bits.
|
|
|
|
@item More @code{lint} warnings
|
|
There are more things that could be checked for portability.
|
|
@end table
|
|
|
|
Following is a list of probable improvements that will make @command{gawk}'s
|
|
source code easier to work with:
|
|
|
|
@table @asis
|
|
@item Loadable module mechanics
|
|
The current extension mechanism works
|
|
(@pxref{Dynamic Extensions}),
|
|
but is rather primitive. It requires a fair amount of manual work
|
|
to create and integrate a loadable module.
|
|
Nor is the current mechanism as portable as might be desired.
|
|
The GNU @command{libtool} package provides a number of features that
|
|
would make using loadable modules much easier.
|
|
@command{gawk} should be changed to use @command{libtool}.
|
|
|
|
@item Loadable module internals
|
|
The API to its internals that @command{gawk} ``exports'' should be revised.
|
|
Too many things are needlessly exposed. A new API should be designed
|
|
and implemented to make module writing easier.
|
|
|
|
@item Better array subscript management
|
|
@command{gawk}'s management of array subscript storage could use revamping,
|
|
so that using the same value to index multiple arrays only
|
|
stores one copy of the index value.
|
|
|
|
@item Integrating the DBUG library
|
|
Integrating Fred Fish's DBUG library would be helpful during development,
|
|
but it's a lot of work to do.
|
|
@end table
|
|
|
|
Following is a list of probable improvements that will make @command{gawk}
|
|
perform better:
|
|
|
|
@table @asis
|
|
@c NEXT ED: remove this item. awka and mawk do these respectively
|
|
@item Compilation of @command{awk} programs
|
|
@command{gawk} uses a Bison (YACC-like)
|
|
parser to convert the script given it into a syntax tree; the syntax
|
|
tree is then executed by a simple recursive evaluator. This method incurs
|
|
a lot of overhead, since the recursive evaluator performs many procedure
|
|
calls to do even the simplest things.
|
|
|
|
It should be possible for @command{gawk} to convert the script's parse tree
|
|
into a C program which the user would then compile, using the normal
|
|
C compiler and a special @command{gawk} library to provide all the needed
|
|
functions (regexps, fields, associative arrays, type coercion, and so on).
|
|
|
|
@c last comma is part of secondary
|
|
@cindex @command{gawk}, interpreter, adding code to
|
|
An easier possibility might be for an intermediate phase of @command{gawk} to
|
|
convert the parse tree into a linear byte code form like the one used
|
|
in GNU Emacs Lisp. The recursive evaluator would then be replaced by
|
|
a straight line byte code interpreter that would be intermediate in speed
|
|
between running a compiled program and doing what @command{gawk} does
|
|
now.
|
|
@end table
|
|
|
|
Finally,
|
|
the programs in the test suite could use documenting in this @value{DOCUMENT}.
|
|
|
|
@xref{Additions},
|
|
if you are interested in tackling any of these projects.
|
|
@c ENDOFRANGE impis
|
|
@c ENDOFRANGE gawii
|
|
|
|
@node Basic Concepts
|
|
@appendix Basic Programming Concepts
|
|
@cindex programming, concepts
|
|
@c STARTOFRANGE procon
|
|
@cindex programming, concepts
|
|
|
|
This @value{APPENDIX} attempts to define some of the basic concepts
|
|
and terms that are used throughout the rest of this @value{DOCUMENT}.
|
|
As this @value{DOCUMENT} is specifically about @command{awk},
|
|
and not about computer programming in general, the coverage here
|
|
is by necessity fairly cursory and simplistic.
|
|
(If you need more background, there are many
|
|
other introductory texts that you should refer to instead.)
|
|
|
|
@menu
|
|
* Basic High Level:: The high level view.
|
|
* Basic Data Typing:: A very quick intro to data types.
|
|
* Floating Point Issues:: Stuff to know about floating-point numbers.
|
|
@end menu
|
|
|
|
@node Basic High Level
|
|
@appendixsec What a Program Does
|
|
|
|
@cindex processing data
|
|
At the most basic level, the job of a program is to process
|
|
some input data and produce results.
|
|
|
|
@c NEXT ED: Use real images here
|
|
@iftex
|
|
@tex
|
|
\expandafter\ifx\csname graph\endcsname\relax \csname newbox\endcsname\graph\fi
|
|
\expandafter\ifx\csname graphtemp\endcsname\relax \csname newdimen\endcsname\graphtemp\fi
|
|
\setbox\graph=\vtop{\vskip 0pt\hbox{%
|
|
\special{pn 20}%
|
|
\special{pa 2425 200}%
|
|
\special{pa 2850 200}%
|
|
\special{fp}%
|
|
\special{sh 1.000}%
|
|
\special{pn 20}%
|
|
\special{pa 2750 175}%
|
|
\special{pa 2850 200}%
|
|
\special{pa 2750 225}%
|
|
\special{pa 2750 175}%
|
|
\special{fp}%
|
|
\special{pn 20}%
|
|
\special{pa 850 200}%
|
|
\special{pa 1250 200}%
|
|
\special{fp}%
|
|
\special{sh 1.000}%
|
|
\special{pn 20}%
|
|
\special{pa 1150 175}%
|
|
\special{pa 1250 200}%
|
|
\special{pa 1150 225}%
|
|
\special{pa 1150 175}%
|
|
\special{fp}%
|
|
\special{pn 20}%
|
|
\special{pa 2950 400}%
|
|
\special{pa 3650 400}%
|
|
\special{pa 3650 0}%
|
|
\special{pa 2950 0}%
|
|
\special{pa 2950 400}%
|
|
\special{fp}%
|
|
\special{pn 10}%
|
|
\special{ar 1800 200 450 200 0 6.28319}%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.200in
|
|
\rlap{\kern 3.300in\lower\graphtemp\hbox to 0pt{\hss Results\hss}}%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.200in
|
|
\rlap{\kern 1.800in\lower\graphtemp\hbox to 0pt{\hss Program\hss}}%
|
|
\special{pn 10}%
|
|
\special{pa 0 400}%
|
|
\special{pa 700 400}%
|
|
\special{pa 700 0}%
|
|
\special{pa 0 0}%
|
|
\special{pa 0 400}%
|
|
\special{fp}%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.200in
|
|
\rlap{\kern 0.350in\lower\graphtemp\hbox to 0pt{\hss Data\hss}}%
|
|
\hbox{\vrule depth0.400in width0pt height 0pt}%
|
|
\kern 3.650in
|
|
}%
|
|
}%
|
|
\centerline{\box\graph}
|
|
@end tex
|
|
@end iftex
|
|
@ifnottex
|
|
@example
|
|
_______
|
|
+------+ / \ +---------+
|
|
| Data | -----> < Program > -----> | Results |
|
|
+------+ \_______/ +---------+
|
|
@end example
|
|
@end ifnottex
|
|
|
|
@cindex compiled programs
|
|
@cindex interpreted programs
|
|
The ``program'' in the figure can be either a compiled
|
|
program@footnote{Compiled programs are typically written
|
|
in lower-level languages such as C, C++, Fortran, or Ada,
|
|
and then translated, or @dfn{compiled}, into a form that
|
|
the computer can execute directly.}
|
|
(such as @command{ls}),
|
|
or it may be @dfn{interpreted}. In the latter case, a machine-executable
|
|
program such as @command{awk} reads your program, and then uses the
|
|
instructions in your program to process the data.
|
|
|
|
@cindex programming, basic steps
|
|
When you write a program, it usually consists
|
|
of the following, very basic set of steps:
|
|
|
|
@c NEXT ED: Use real images here
|
|
@iftex
|
|
@tex
|
|
\expandafter\ifx\csname graph\endcsname\relax \csname newbox\endcsname\graph\fi
|
|
\expandafter\ifx\csname graphtemp\endcsname\relax \csname newdimen\endcsname\graphtemp\fi
|
|
\setbox\graph=\vtop{\vskip 0pt\hbox{%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.600in
|
|
\rlap{\kern 2.800in\lower\graphtemp\hbox to 0pt{\hss Yes\hss}}%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.100in
|
|
\rlap{\kern 3.300in\lower\graphtemp\hbox to 0pt{\hss No\hss}}%
|
|
\special{pn 8}%
|
|
\special{pa 2100 1000}%
|
|
\special{pa 1600 1000}%
|
|
\special{pa 1600 1000}%
|
|
\special{pa 1600 300}%
|
|
\special{fp}%
|
|
\special{sh 1.000}%
|
|
\special{pn 8}%
|
|
\special{pa 1575 400}%
|
|
\special{pa 1600 300}%
|
|
\special{pa 1625 400}%
|
|
\special{pa 1575 400}%
|
|
\special{fp}%
|
|
\special{pn 8}%
|
|
\special{pa 2600 500}%
|
|
\special{pa 2600 900}%
|
|
\special{fp}%
|
|
\special{sh 1.000}%
|
|
\special{pn 8}%
|
|
\special{pa 2625 800}%
|
|
\special{pa 2600 900}%
|
|
\special{pa 2575 800}%
|
|
\special{pa 2625 800}%
|
|
\special{fp}%
|
|
\special{pn 8}%
|
|
\special{pa 3200 200}%
|
|
\special{pa 4000 200}%
|
|
\special{fp}%
|
|
\special{sh 1.000}%
|
|
\special{pn 8}%
|
|
\special{pa 3900 175}%
|
|
\special{pa 4000 200}%
|
|
\special{pa 3900 225}%
|
|
\special{pa 3900 175}%
|
|
\special{fp}%
|
|
\special{pn 8}%
|
|
\special{pa 1400 200}%
|
|
\special{pa 2100 200}%
|
|
\special{fp}%
|
|
\special{sh 1.000}%
|
|
\special{pn 8}%
|
|
\special{pa 2000 175}%
|
|
\special{pa 2100 200}%
|
|
\special{pa 2000 225}%
|
|
\special{pa 2000 175}%
|
|
\special{fp}%
|
|
\special{pn 8}%
|
|
\special{ar 2600 1000 400 100 0 6.28319}%
|
|
\graphtemp=.5ex\advance\graphtemp by 1.000in
|
|
\rlap{\kern 2.600in\lower\graphtemp\hbox to 0pt{\hss Process\hss}}%
|
|
\special{pn 8}%
|
|
\special{pa 2200 400}%
|
|
\special{pa 3100 400}%
|
|
\special{pa 3100 0}%
|
|
\special{pa 2200 0}%
|
|
\special{pa 2200 400}%
|
|
\special{fp}%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.200in
|
|
\rlap{\kern 2.688in\lower\graphtemp\hbox to 0pt{\hss More Data?\hss}}%
|
|
\special{pn 8}%
|
|
\special{ar 650 200 650 200 0 6.28319}%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.200in
|
|
\rlap{\kern 0.613in\lower\graphtemp\hbox to 0pt{\hss Initialization\hss}}%
|
|
\special{pn 8}%
|
|
\special{ar 0 200 0 0 0 6.28319}%
|
|
\special{pn 8}%
|
|
\special{ar 4550 200 450 100 0 6.28319}%
|
|
\graphtemp=.5ex\advance\graphtemp by 0.200in
|
|
\rlap{\kern 4.600in\lower\graphtemp\hbox to 0pt{\hss Clean Up\hss}}%
|
|
\hbox{\vrule depth1.100in width0pt height 0pt}%
|
|
\kern 5.000in
|
|
}%
|
|
}%
|
|
\centerline{\box\graph}
|
|
@end tex
|
|
@end iftex
|
|
@ifnottex
|
|
@example
|
|
______
|
|
+----------------+ / More \ No +----------+
|
|
| Initialization | -------> < Data > -------> | Clean Up |
|
|
+----------------+ ^ \ ? / +----------+
|
|
| +--+-+
|
|
| | Yes
|
|
| |
|
|
| V
|
|
| +---------+
|
|
+-----+ Process |
|
|
+---------+
|
|
@end example
|
|
@end ifnottex
|
|
|
|
@table @asis
|
|
@item Initialization
|
|
These are the things you do before actually starting to process
|
|
data, such as checking arguments, initializing any data you need
|
|
to work with, and so on.
|
|
This step corresponds to @command{awk}'s @code{BEGIN} rule
|
|
(@pxref{BEGIN/END}).
|
|
|
|
If you were baking a cake, this might consist of laying out all the
|
|
mixing bowls and the baking pan, and making sure you have all the
|
|
ingredients that you need.
|
|
|
|
@item Processing
|
|
This is where the actual work is done. Your program reads data,
|
|
one logical chunk at a time, and processes it as appropriate.
|
|
|
|
In most programming languages, you have to manually manage the reading
|
|
of data, checking to see if there is more each time you read a chunk.
|
|
@command{awk}'s pattern-action paradigm
|
|
(@pxref{Getting Started})
|
|
handles the mechanics of this for you.
|
|
|
|
In baking a cake, the processing corresponds to the actual labor:
|
|
breaking eggs, mixing the flour, water, and other ingredients, and then putting the cake
|
|
into the oven.
|
|
|
|
@item Clean Up
|
|
Once you've processed all the data, you may have things you need to
|
|
do before exiting.
|
|
This step corresponds to @command{awk}'s @code{END} rule
|
|
(@pxref{BEGIN/END}).
|
|
|
|
After the cake comes out of the oven, you still have to wrap it in
|
|
plastic wrap to keep anyone from tasting it, as well as wash
|
|
the mixing bowls and utensils.
|
|
@end table
|
|
|
|
@cindex algorithms
|
|
An @dfn{algorithm} is a detailed set of instructions necessary to accomplish
|
|
a task, or process data. It is much the same as a recipe for baking
|
|
a cake. Programs implement algorithms. Often, it is up to you to design
|
|
the algorithm and implement it, simultaneously.
|
|
|
|
@cindex records
|
|
@cindex fields
|
|
The ``logical chunks'' we talked about previously are called @dfn{records},
|
|
similar to the records a company keeps on employees, a school keeps for
|
|
students, or a doctor keeps for patients.
|
|
Each record has many component parts, such as first and last names,
|
|
date of birth, address, and so on. The component parts are referred
|
|
to as the @dfn{fields} of the record.
|
|
|
|
The act of reading data is termed @dfn{input}, and that of
|
|
generating results, not too surprisingly, is termed @dfn{output}.
|
|
They are often referred to together as ``input/output,''
|
|
and even more often, as ``I/O'' for short.
|
|
(You will also see ``input'' and ``output'' used as verbs.)
|
|
|
|
@cindex data-driven languages
|
|
@c comma is part of primary
|
|
@cindex languages, data-driven
|
|
@command{awk} manages the reading of data for you, as well as the
|
|
breaking it up into records and fields. Your program's job is to
|
|
tell @command{awk} what to with the data. You do this by describing
|
|
@dfn{patterns} in the data to look for, and @dfn{actions} to execute
|
|
when those patterns are seen. This @dfn{data-driven} nature of
|
|
@command{awk} programs usually makes them both easier to write
|
|
and easier to read.
|
|
|
|
@node Basic Data Typing
|
|
@appendixsec Data Values in a Computer
|
|
|
|
@cindex variables
|
|
In a program,
|
|
you keep track of information and values in things called @dfn{variables}.
|
|
A variable is just a name for a given value, such as @code{first_name},
|
|
@code{last_name}, @code{address}, and so on.
|
|
@command{awk} has several predefined variables, and it has
|
|
special names to refer to the current input record
|
|
and the fields of the record.
|
|
You may also group multiple
|
|
associated values under one name, as an array.
|
|
|
|
@cindex values, numeric
|
|
@cindex values, string
|
|
@cindex scalar values
|
|
Data, particularly in @command{awk}, consists of either numeric
|
|
values, such as 42 or 3.1415927, or string values.
|
|
String values are essentially anything that's not a number, such as a name.
|
|
Strings are sometimes referred to as @dfn{character data}, since they
|
|
store the individual characters that comprise them.
|
|
Individual variables, as well as numeric and string variables, are
|
|
referred to as @dfn{scalar} values.
|
|
Groups of values, such as arrays, are not scalars.
|
|
|
|
@cindex integers
|
|
@cindex floating-point, numbers
|
|
@cindex numbers, floating-point
|
|
Within computers, there are two kinds of numeric values: @dfn{integers}
|
|
and @dfn{floating-point}.
|
|
In school, integer values were referred to as ``whole'' numbers---that is,
|
|
numbers without any fractional part, such as 1, 42, or @minus{}17.
|
|
The advantage to integer numbers is that they represent values exactly.
|
|
The disadvantage is that their range is limited. On most modern systems,
|
|
this range is @minus{}2,147,483,648 to 2,147,483,647.
|
|
|
|
@cindex unsigned integers
|
|
@cindex integers, unsigned
|
|
Integer values come in two flavors: @dfn{signed} and @dfn{unsigned}.
|
|
Signed values may be negative or positive, with the range of values just
|
|
described.
|
|
Unsigned values are always positive. On most modern systems,
|
|
the range is from 0 to 4,294,967,295.
|
|
|
|
@cindex double-precision floating-point
|
|
@cindex single-precision floating-point
|
|
Floating-point numbers represent what are called ``real'' numbers; i.e.,
|
|
those that do have a fractional part, such as 3.1415927.
|
|
The advantage to floating-point numbers is that they
|
|
can represent a much larger range of values.
|
|
The disadvantage is that there are numbers that they cannot represent
|
|
exactly.
|
|
@command{awk} uses @dfn{double-precision} floating-point numbers, which
|
|
can hold more digits than @dfn{single-precision}
|
|
floating-point numbers.
|
|
Floating-point issues are discussed more fully in
|
|
@ref{Floating Point Issues}.
|
|
|
|
At the very lowest level, computers store values as groups of binary digits,
|
|
or @dfn{bits}. Modern computers group bits into groups of eight, called @dfn{bytes}.
|
|
Advanced applications sometimes have to manipulate bits directly,
|
|
and @command{gawk} provides functions for doing so.
|
|
|
|
@cindex null strings
|
|
While you are probably used to the idea of a number without a value (i.e., zero),
|
|
it takes a bit more getting used to the idea of zero-length character data.
|
|
Nevertheless, such a thing exists.
|
|
It is called the @dfn{null string}.
|
|
The null string is character data that has no value.
|
|
In other words, it is empty. It is written in @command{awk} programs
|
|
like this: @code{""}.
|
|
|
|
Humans are used to working in decimal; i.e., base 10. In base 10,
|
|
numbers go from 0 to 9, and then ``roll over'' into the next
|
|
column. (Remember grade school? 42 is 4 times 10 plus 2.)
|
|
|
|
There are other number bases though. Computers commonly use base 2
|
|
or @dfn{binary}, base 8 or @dfn{octal}, and base 16 or @dfn{hexadecimal}.
|
|
In binary, each column represents two times the value in the column to
|
|
its right. Each column may contain either a 0 or a 1.
|
|
Thus, binary 1010 represents 1 times 8, plus 0 times 4, plus 1 times 2,
|
|
plus 0 times 1, or decimal 10.
|
|
Octal and hexadecimal are discussed more in
|
|
@ref{Nondecimal-numbers}.
|
|
|
|
Programs are written in programming languages.
|
|
Hundreds, if not thousands, of programming languages exist.
|
|
One of the most popular is the C programming language.
|
|
The C language had a very strong influence on the design of
|
|
the @command{awk} language.
|
|
|
|
@cindex Kernighan, Brian
|
|
@cindex Ritchie, Dennis
|
|
There have been several versions of C. The first is often referred to
|
|
as ``K&R'' C, after the initials of Brian Kernighan and Dennis Ritchie,
|
|
the authors of the first book on C. (Dennis Ritchie created the language,
|
|
and Brian Kernighan was one of the creators of @command{awk}.)
|
|
|
|
In the mid-1980s, an effort began to produce an international standard
|
|
for C. This work culminated in 1989, with the production of the ANSI
|
|
standard for C. This standard became an ISO standard in 1990.
|
|
Where it makes sense, POSIX @command{awk} is compatible with 1990 ISO C.
|
|
|
|
In 1999, a revised ISO C standard was approved and released.
|
|
Future versions of @command{gawk} will be as compatible as possible
|
|
with this standard.
|
|
|
|
@node Floating Point Issues
|
|
@appendixsec Floating-Point Number Caveats
|
|
|
|
As mentioned earlier, floating-point numbers represent what are called
|
|
``real'' numbers, i.e., those that have a fractional part. @command{awk}
|
|
uses double-precision floating-point numbers to represent all
|
|
numeric values. This @value{SECTION} describes some of the issues
|
|
involved in using floating-point numbers.
|
|
|
|
There is a very nice paper on floating-point arithmetic by
|
|
David Goldberg, ``What Every
|
|
Computer Scientist Should Know About Floating-point Arithmetic,''
|
|
@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03),
|
|
5-48.@footnote{@uref{http://www.validlab.com/goldberg/paper.ps}.}
|
|
This is worth reading if you are interested in the details,
|
|
but it does require a background in computer science.
|
|
|
|
Internally, @command{awk} keeps both the numeric value
|
|
(double-precision floating-point) and the string value for a variable.
|
|
Separately, @command{awk} keeps
|
|
track of what type the variable has
|
|
(@pxref{Typing and Comparison}),
|
|
which plays a role in how variables are used in comparisons.
|
|
|
|
It is important to note that the string value for a number may not
|
|
reflect the full value (all the digits) that the numeric value
|
|
actually contains.
|
|
The following program (@file{values.awk}) illustrates this:
|
|
|
|
@example
|
|
@{
|
|
$1 = $2 + $3
|
|
# see it for what it is
|
|
printf("$1 = %.12g\n", $1)
|
|
# use CONVFMT
|
|
a = "<" $1 ">"
|
|
print "a =", a
|
|
@group
|
|
# use OFMT
|
|
print "$1 =", $1
|
|
@end group
|
|
@}
|
|
@end example
|
|
|
|
@noindent
|
|
This program shows the full value of the sum of @code{$2} and @code{$3}
|
|
using @code{printf}, and then prints the string values obtained
|
|
from both automatic conversion (via @code{CONVFMT}) and
|
|
from printing (via @code{OFMT}).
|
|
|
|
Here is what happens when the program is run:
|
|
|
|
@example
|
|
$ echo 2 3.654321 1.2345678 | awk -f values.awk
|
|
@print{} $1 = 4.8888888
|
|
@print{} a = <4.88889>
|
|
@print{} $1 = 4.88889
|
|
@end example
|
|
|
|
This makes it clear that the full numeric value is different from
|
|
what the default string representations show.
|
|
|
|
@code{CONVFMT}'s default value is @code{"%.6g"}, which yields a value with
|
|
at least six significant digits. For some applications, you might want to
|
|
change it to specify more precision.
|
|
On most modern machines, most of the time,
|
|
17 digits is enough to capture a floating-point number's
|
|
value exactly.@footnote{Pathological cases can require up to
|
|
752 digits (!), but we doubt that you need to worry about this.}
|
|
|
|
@cindex floating-point
|
|
Unlike numbers in the abstract sense (such as what you studied in high school
|
|
or college math), numbers stored in computers are limited in certain ways.
|
|
They cannot represent an infinite number of digits, nor can they always
|
|
represent things exactly.
|
|
In particular,
|
|
floating-point numbers cannot
|
|
always represent values exactly. Here is an example:
|
|
|
|
@example
|
|
$ awk '@{ printf("%010d\n", $1 * 100) @}'
|
|
515.79
|
|
@print{} 0000051579
|
|
515.80
|
|
@print{} 0000051579
|
|
515.81
|
|
@print{} 0000051580
|
|
515.82
|
|
@print{} 0000051582
|
|
@kbd{@value{CTL}-d}
|
|
@end example
|
|
|
|
@noindent
|
|
This shows that some values can be represented exactly,
|
|
whereas others are only approximated. This is not a ``bug''
|
|
in @command{awk}, but simply an artifact of how computers
|
|
represent numbers.
|
|
|
|
@cindex negative zero
|
|
@cindex positive zero
|
|
@c comma is part of primary
|
|
@cindex zero, negative vs.@: positive
|
|
Another peculiarity of floating-point numbers on modern systems
|
|
is that they often have more than one representation for the number zero!
|
|
In particular, it is possible to represent ``minus zero'' as well as
|
|
regular, or ``positive'' zero.
|
|
|
|
This example shows that negative and positive zero are distinct values
|
|
when stored internally, but that they are in fact equal to each other,
|
|
as well as to ``regular'' zero:
|
|
|
|
@smallexample
|
|
$ gawk 'BEGIN @{ mz = -0 ; pz = 0
|
|
> printf "-0 = %g, +0 = %g, (-0 == +0) -> %d\n", mz, pz, mz == pz
|
|
> printf "mz == 0 -> %d, pz == 0 -> %d\n", mz == 0, pz == 0
|
|
> @}'
|
|
@print{} -0 = -0, +0 = 0, (-0 == +0) -> 1
|
|
@print{} mz == 0 -> 1, pz == 0 -> 1
|
|
@end smallexample
|
|
|
|
It helps to keep this in mind should you process numeric data
|
|
that contains negative zero values; the fact that the zero is negative
|
|
is noted and can affect comparisons.
|
|
@c ENDOFRANGE procon
|
|
|
|
@node Glossary
|
|
@unnumbered Glossary
|
|
|
|
@table @asis
|
|
@item Action
|
|
A series of @command{awk} statements attached to a rule. If the rule's
|
|
pattern matches an input record, @command{awk} executes the
|
|
rule's action. Actions are always enclosed in curly braces.
|
|
(@xref{Action Overview}.)
|
|
|
|
@cindex Spencer, Henry
|
|
@cindex @command{sed} utility
|
|
@cindex amazing @command{awk} assembler (@command{aaa})
|
|
@item Amazing @command{awk} Assembler
|
|
Henry Spencer at the University of Toronto wrote a retargetable assembler
|
|
completely as @command{sed} and @command{awk} scripts. It is thousands
|
|
of lines long, including machine descriptions for several eight-bit
|
|
microcomputers. It is a good example of a program that would have been
|
|
better written in another language.
|
|
You can get it from @uref{ftp://ftp.freefriends.org/arnold/Awkstuff/aaa.tgz}.
|
|
|
|
@cindex amazingly workable formatter (@command{awf})
|
|
@cindex @command{awf} (amazingly workable formatter) program
|
|
@item Amazingly Workable Formatter (@command{awf})
|
|
Henry Spencer at the University of Toronto wrote a formatter that accepts
|
|
a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting
|
|
commands, using @command{awk} and @command{sh}.
|
|
It is available over the Internet
|
|
from @uref{ftp://ftp.freefriends.org/arnold/Awkstuff/awf.tgz}.
|
|
|
|
@item Anchor
|
|
The regexp metacharacters @samp{^} and @samp{$}, which force the match
|
|
to the beginning or end of the string, respectively.
|
|
|
|
@cindex ANSI
|
|
@item ANSI
|
|
The American National Standards Institute. This organization produces
|
|
many standards, among them the standards for the C and C++ programming
|
|
languages.
|
|
These standards often become international standards as well. See also
|
|
``ISO.''
|
|
|
|
@item Array
|
|
A grouping of multiple values under the same name.
|
|
Most languages just provide sequential arrays.
|
|
@command{awk} provides associative arrays.
|
|
|
|
@item Assertion
|
|
A statement in a program that a condition is true at this point in the program.
|
|
Useful for reasoning about how a program is supposed to behave.
|
|
|
|
@item Assignment
|
|
An @command{awk} expression that changes the value of some @command{awk}
|
|
variable or data object. An object that you can assign to is called an
|
|
@dfn{lvalue}. The assigned values are called @dfn{rvalues}.
|
|
@xref{Assignment Ops}.
|
|
|
|
@item Associative Array
|
|
Arrays in which the indices may be numbers or strings, not just
|
|
sequential integers in a fixed range.
|
|
|
|
@item @command{awk} Language
|
|
The language in which @command{awk} programs are written.
|
|
|
|
@item @command{awk} Program
|
|
An @command{awk} program consists of a series of @dfn{patterns} and
|
|
@dfn{actions}, collectively known as @dfn{rules}. For each input record
|
|
given to the program, the program's rules are all processed in turn.
|
|
@command{awk} programs may also contain function definitions.
|
|
|
|
@item @command{awk} Script
|
|
Another name for an @command{awk} program.
|
|
|
|
@item Bash
|
|
The GNU version of the standard shell
|
|
@ifnotinfo
|
|
(the @b{B}ourne-@b{A}gain @b{SH}ell).
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
(the Bourne-Again SHell).
|
|
@end ifinfo
|
|
See also ``Bourne Shell.''
|
|
|
|
@item BBS
|
|
See ``Bulletin Board System.''
|
|
|
|
@item Bit
|
|
Short for ``Binary Digit.''
|
|
All values in computer memory ultimately reduce to binary digits: values
|
|
that are either zero or one.
|
|
Groups of bits may be interpreted differently---as integers,
|
|
floating-point numbers, character data, addresses of other
|
|
memory objects, or other data.
|
|
@command{awk} lets you work with floating-point numbers and strings.
|
|
@command{gawk} lets you manipulate bit values with the built-in
|
|
functions described in
|
|
@ref{Bitwise Functions}.
|
|
|
|
Computers are often defined by how many bits they use to represent integer
|
|
values. Typical systems are 32-bit systems, but 64-bit systems are
|
|
becoming increasingly popular, and 16-bit systems are waning in
|
|
popularity.
|
|
|
|
@item Boolean Expression
|
|
Named after the English mathematician Boole. See also ``Logical Expression.''
|
|
|
|
@item Bourne Shell
|
|
The standard shell (@file{/bin/sh}) on Unix and Unix-like systems,
|
|
originally written by Steven R.@: Bourne.
|
|
Many shells (@command{bash}, @command{ksh}, @command{pdksh}, @command{zsh}) are
|
|
generally upwardly compatible with the Bourne shell.
|
|
|
|
@item Built-in Function
|
|
The @command{awk} language provides built-in functions that perform various
|
|
numerical, I/O-related, and string computations. Examples are
|
|
@code{sqrt} (for the square root of a number) and @code{substr} (for a
|
|
substring of a string).
|
|
@command{gawk} provides functions for timestamp management, bit manipulation,
|
|
and runtime string translation.
|
|
(@xref{Built-in}.)
|
|
|
|
@item Built-in Variable
|
|
@code{ARGC},
|
|
@code{ARGV},
|
|
@code{CONVFMT},
|
|
@code{ENVIRON},
|
|
@code{FILENAME},
|
|
@code{FNR},
|
|
@code{FS},
|
|
@code{NF},
|
|
@code{NR},
|
|
@code{OFMT},
|
|
@code{OFS},
|
|
@code{ORS},
|
|
@code{RLENGTH},
|
|
@code{RSTART},
|
|
@code{RS},
|
|
and
|
|
@code{SUBSEP}
|
|
are the variables that have special meaning to @command{awk}.
|
|
In addition,
|
|
@code{ARGIND},
|
|
@code{BINMODE},
|
|
@code{ERRNO},
|
|
@code{FIELDWIDTHS},
|
|
@code{IGNORECASE},
|
|
@code{LINT},
|
|
@code{PROCINFO},
|
|
@code{RT},
|
|
and
|
|
@code{TEXTDOMAIN}
|
|
are the variables that have special meaning to @command{gawk}.
|
|
Changing some of them affects @command{awk}'s running environment.
|
|
(@xref{Built-in Variables}.)
|
|
|
|
@item Braces
|
|
See ``Curly Braces.''
|
|
|
|
@item Bulletin Board System
|
|
A computer system allowing users to log in and read and/or leave messages
|
|
for other users of the system, much like leaving paper notes on a bulletin
|
|
board.
|
|
|
|
@item C
|
|
The system programming language that most GNU software is written in. The
|
|
@command{awk} programming language has C-like syntax, and this @value{DOCUMENT}
|
|
points out similarities between @command{awk} and C when appropriate.
|
|
|
|
In general, @command{gawk} attempts to be as similar to the 1990 version
|
|
of ISO C as makes sense. Future versions of @command{gawk} may adopt features
|
|
from the newer 1999 standard, as appropriate.
|
|
|
|
@item C++
|
|
A popular object-oriented programming language derived from C.
|
|
|
|
@cindex ISO 8859-1
|
|
@cindex ISO Latin-1
|
|
@cindex character sets (machine character encodings)
|
|
@item Character Set
|
|
The set of numeric codes used by a computer system to represent the
|
|
characters (letters, numbers, punctuation, etc.) of a particular country
|
|
or place. The most common character set in use today is ASCII (American
|
|
Standard Code for Information Interchange). Many European
|
|
countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1).
|
|
|
|
@cindex @command{chem} utility
|
|
@item CHEM
|
|
A preprocessor for @command{pic} that reads descriptions of molecules
|
|
and produces @command{pic} input for drawing them.
|
|
It was written in @command{awk}
|
|
by Brian Kernighan and Jon Bentley, and is available from
|
|
@uref{http://cm.bell-labs.com/netlib/typesetting/chem.gz}.
|
|
|
|
@item Coprocess
|
|
A subordinate program with which two-way communications is possible.
|
|
|
|
@cindex compiled programs
|
|
@item Compiler
|
|
A program that translates human-readable source code into
|
|
machine-executable object code. The object code is then executed
|
|
directly by the computer.
|
|
See also ``Interpreter.''
|
|
|
|
@item Compound Statement
|
|
A series of @command{awk} statements, enclosed in curly braces. Compound
|
|
statements may be nested.
|
|
(@xref{Statements}.)
|
|
|
|
@item Concatenation
|
|
Concatenating two strings means sticking them together, one after another,
|
|
producing a new string. For example, the string @samp{foo} concatenated with
|
|
the string @samp{bar} gives the string @samp{foobar}.
|
|
(@xref{Concatenation}.)
|
|
|
|
@item Conditional Expression
|
|
An expression using the @samp{?:} ternary operator, such as
|
|
@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression
|
|
@var{expr1} is evaluated; if the result is true, the value of the whole
|
|
expression is the value of @var{expr2}; otherwise the value is
|
|
@var{expr3}. In either case, only one of @var{expr2} and @var{expr3}
|
|
is evaluated. (@xref{Conditional Exp}.)
|
|
|
|
@item Comparison Expression
|
|
A relation that is either true or false, such as @samp{(a < b)}.
|
|
Comparison expressions are used in @code{if}, @code{while}, @code{do},
|
|
and @code{for}
|
|
statements, and in patterns to select which input records to process.
|
|
(@xref{Typing and Comparison}.)
|
|
|
|
@item Curly Braces
|
|
The characters @samp{@{} and @samp{@}}. Curly braces are used in
|
|
@command{awk} for delimiting actions, compound statements, and function
|
|
bodies.
|
|
|
|
@cindex dark corner
|
|
@item Dark Corner
|
|
An area in the language where specifications often were (or still
|
|
are) not clear, leading to unexpected or undesirable behavior.
|
|
Such areas are marked in this @value{DOCUMENT} with
|
|
@iftex
|
|
the picture of a flashlight in the margin
|
|
@end iftex
|
|
@ifnottex
|
|
``(d.c.)'' in the text
|
|
@end ifnottex
|
|
and are indexed under the heading ``dark corner.''
|
|
|
|
@item Data Driven
|
|
A description of @command{awk} programs, where you specify the data you
|
|
are interested in processing, and what to do when that data is seen.
|
|
|
|
@item Data Objects
|
|
These are numbers and strings of characters. Numbers are converted into
|
|
strings and vice versa, as needed.
|
|
(@xref{Conversion}.)
|
|
|
|
@item Deadlock
|
|
The situation in which two communicating processes are each waiting
|
|
for the other to perform an action.
|
|
|
|
@item Double-Precision
|
|
An internal representation of numbers that can have fractional parts.
|
|
Double-precision numbers keep track of more digits than do single-precision
|
|
numbers, but operations on them are sometimes more expensive. This is the way
|
|
@command{awk} stores numeric values. It is the C type @code{double}.
|
|
|
|
@item Dynamic Regular Expression
|
|
A dynamic regular expression is a regular expression written as an
|
|
ordinary expression. It could be a string constant, such as
|
|
@code{"foo"}, but it may also be an expression whose value can vary.
|
|
(@xref{Computed Regexps}.)
|
|
|
|
@item Environment
|
|
A collection of strings, of the form @var{name@code{=}val}, that each
|
|
program has available to it. Users generally place values into the
|
|
environment in order to provide information to various programs. Typical
|
|
examples are the environment variables @env{HOME} and @env{PATH}.
|
|
|
|
@item Empty String
|
|
See ``Null String.''
|
|
|
|
@cindex epoch, definition of
|
|
@item Epoch
|
|
The date used as the ``beginning of time'' for timestamps.
|
|
Time values in Unix systems are represented as seconds since the epoch,
|
|
with library functions available for converting these values into
|
|
standard date and time formats.
|
|
|
|
The epoch on Unix and POSIX systems is 1970-01-01 00:00:00 UTC.
|
|
See also ``GMT'' and ``UTC.''
|
|
|
|
@item Escape Sequences
|
|
A special sequence of characters used for describing nonprinting
|
|
characters, such as @samp{\n} for newline or @samp{\033} for the ASCII
|
|
ESC (Escape) character. (@xref{Escape Sequences}.)
|
|
|
|
@item FDL
|
|
See ``Free Documentation License.''
|
|
|
|
@item Field
|
|
When @command{awk} reads an input record, it splits the record into pieces
|
|
separated by whitespace (or by a separator regexp that you can
|
|
change by setting the built-in variable @code{FS}). Such pieces are
|
|
called fields. If the pieces are of fixed length, you can use the built-in
|
|
variable @code{FIELDWIDTHS} to describe their lengths.
|
|
(@xref{Field Separators},
|
|
and
|
|
@ref{Constant Size}.)
|
|
|
|
@item Flag
|
|
A variable whose truth value indicates the existence or nonexistence
|
|
of some condition.
|
|
|
|
@item Floating-Point Number
|
|
Often referred to in mathematical terms as a ``rational'' or real number,
|
|
this is just a number that can have a fractional part.
|
|
See also ``Double-Precision'' and ``Single-Precision.''
|
|
|
|
@item Format
|
|
Format strings are used to control the appearance of output in the
|
|
@code{strftime} and @code{sprintf} functions, and are used in the
|
|
@code{printf} statement as well. Also, data conversions from numbers to strings
|
|
are controlled by the format string contained in the built-in variable
|
|
@code{CONVFMT}. (@xref{Control Letters}.)
|
|
|
|
@item Free Documentation License
|
|
This document describes the terms under which this @value{DOCUMENT}
|
|
is published and may be copied. (@xref{GNU Free Documentation License}.)
|
|
|
|
@item Function
|
|
A specialized group of statements used to encapsulate general
|
|
or program-specific tasks. @command{awk} has a number of built-in
|
|
functions, and also allows you to define your own.
|
|
(@xref{Functions}.)
|
|
|
|
@item FSF
|
|
See ``Free Software Foundation.''
|
|
|
|
@cindex FSF (Free Software Foundation)
|
|
@cindex Free Software Foundation (FSF)
|
|
@cindex Stallman, Richard
|
|
@item Free Software Foundation
|
|
A nonprofit organization dedicated
|
|
to the production and distribution of freely distributable software.
|
|
It was founded by Richard M.@: Stallman, the author of the original
|
|
Emacs editor. GNU Emacs is the most widely used version of Emacs today.
|
|
|
|
@item @command{gawk}
|
|
The GNU implementation of @command{awk}.
|
|
|
|
@cindex GPL (General Public License)
|
|
@cindex General Public License (GPL)
|
|
@cindex GNU General Public License
|
|
@item General Public License
|
|
This document describes the terms under which @command{gawk} and its source
|
|
code may be distributed. (@xref{Copying}.)
|
|
|
|
@item GMT
|
|
``Greenwich Mean Time.''
|
|
This is the old term for UTC.
|
|
It is the time of day used as the epoch for Unix and POSIX systems.
|
|
See also ``Epoch'' and ``UTC.''
|
|
|
|
@cindex FSF (Free Software Foundation)
|
|
@cindex Free Software Foundation (FSF)
|
|
@cindex GNU Project
|
|
@item GNU
|
|
``GNU's not Unix''. An on-going project of the Free Software Foundation
|
|
to create a complete, freely distributable, POSIX-compliant computing
|
|
environment.
|
|
|
|
@item GNU/Linux
|
|
A variant of the GNU system using the Linux kernel, instead of the
|
|
Free Software Foundation's Hurd kernel.
|
|
Linux is a stable, efficient, full-featured clone of Unix that has
|
|
been ported to a variety of architectures.
|
|
It is most popular on PC-class systems, but runs well on a variety of
|
|
other systems too.
|
|
The Linux kernel source code is available under the terms of the GNU General
|
|
Public License, which is perhaps its most important aspect.
|
|
|
|
@item GPL
|
|
See ``General Public License.''
|
|
|
|
@item Hexadecimal
|
|
Base 16 notation, where the digits are @code{0}--@code{9} and
|
|
@code{A}--@code{F}, with @samp{A}
|
|
representing 10, @samp{B} representing 11, and so on, up to @samp{F} for 15.
|
|
Hexadecimal numbers are written in C using a leading @samp{0x},
|
|
to indicate their base. Thus, @code{0x12} is 18 (1 times 16 plus 2).
|
|
|
|
@item I/O
|
|
Abbreviation for ``Input/Output,'' the act of moving data into and/or
|
|
out of a running program.
|
|
|
|
@item Input Record
|
|
A single chunk of data that is read in by @command{awk}. Usually, an @command{awk} input
|
|
record consists of one line of text.
|
|
(@xref{Records}.)
|
|
|
|
@item Integer
|
|
A whole number, i.e., a number that does not have a fractional part.
|
|
|
|
@item Internationalization
|
|
The process of writing or modifying a program so
|
|
that it can use multiple languages without requiring
|
|
further source code changes.
|
|
|
|
@cindex interpreted programs
|
|
@item Interpreter
|
|
A program that reads human-readable source code directly, and uses
|
|
the instructions in it to process data and produce results.
|
|
@command{awk} is typically (but not always) implemented as an interpreter.
|
|
See also ``Compiler.''
|
|
|
|
@item Interval Expression
|
|
A component of a regular expression that lets you specify repeated matches of
|
|
some part of the regexp. Interval expressions were not traditionally available
|
|
in @command{awk} programs.
|
|
|
|
@cindex ISO
|
|
@item ISO
|
|
The International Standards Organization.
|
|
This organization produces international standards for many things, including
|
|
programming languages, such as C and C++.
|
|
In the computer arena, important standards like those for C, C++, and POSIX
|
|
become both American national and ISO international standards simultaneously.
|
|
This @value{DOCUMENT} refers to Standard C as ``ISO C'' throughout.
|
|
|
|
@item Keyword
|
|
In the @command{awk} language, a keyword is a word that has special
|
|
meaning. Keywords are reserved and may not be used as variable names.
|
|
|
|
@command{gawk}'s keywords are:
|
|
@code{BEGIN},
|
|
@code{END},
|
|
@code{if},
|
|
@code{else},
|
|
@code{while},
|
|
@code{do@dots{}while},
|
|
@code{for},
|
|
@code{for@dots{}in},
|
|
@code{break},
|
|
@code{continue},
|
|
@code{delete},
|
|
@code{next},
|
|
@code{nextfile},
|
|
@code{function},
|
|
@code{func},
|
|
and
|
|
@code{exit}.
|
|
|
|
@cindex LGPL (Lesser General Public License)
|
|
@cindex Lesser General Public License (LGPL)
|
|
@cindex GNU Lesser General Public License
|
|
@item Lesser General Public License
|
|
This document describes the terms under which binary library archives
|
|
or shared objects,
|
|
and their source code may be distributed.
|
|
|
|
@item Linux
|
|
See ``GNU/Linux.''
|
|
|
|
@item LGPL
|
|
See ``Lesser General Public License.''
|
|
|
|
@item Localization
|
|
The process of providing the data necessary for an
|
|
internationalized program to work in a particular language.
|
|
|
|
@item Logical Expression
|
|
An expression using the operators for logic, AND, OR, and NOT, written
|
|
@samp{&&}, @samp{||}, and @samp{!} in @command{awk}. Often called Boolean
|
|
expressions, after the mathematician who pioneered this kind of
|
|
mathematical logic.
|
|
|
|
@item Lvalue
|
|
An expression that can appear on the left side of an assignment
|
|
operator. In most languages, lvalues can be variables or array
|
|
elements. In @command{awk}, a field designator can also be used as an
|
|
lvalue.
|
|
|
|
@item Matching
|
|
The act of testing a string against a regular expression. If the
|
|
regexp describes the contents of the string, it is said to @dfn{match} it.
|
|
|
|
@item Metacharacters
|
|
Characters used within a regexp that do not stand for themselves.
|
|
Instead, they denote regular expression operations, such as repetition,
|
|
grouping, or alternation.
|
|
|
|
@item Null String
|
|
A string with no characters in it. It is represented explicitly in
|
|
@command{awk} programs by placing two double quote characters next to
|
|
each other (@code{""}). It can appear in input data by having two successive
|
|
occurrences of the field separator appear next to each other.
|
|
|
|
@item Number
|
|
A numeric-valued data object. Modern @command{awk} implementations use
|
|
double-precision floating-point to represent numbers.
|
|
Very old @command{awk} implementations use single-precision floating-point.
|
|
|
|
@item Octal
|
|
Base-eight notation, where the digits are @code{0}--@code{7}.
|
|
Octal numbers are written in C using a leading @samp{0},
|
|
to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3).
|
|
|
|
@cindex P1003.2 POSIX standard
|
|
@item P1003.2
|
|
See ``POSIX.''
|
|
|
|
@item Pattern
|
|
Patterns tell @command{awk} which input records are interesting to which
|
|
rules.
|
|
|
|
A pattern is an arbitrary conditional expression against which input is
|
|
tested. If the condition is satisfied, the pattern is said to @dfn{match}
|
|
the input record. A typical pattern might compare the input record against
|
|
a regular expression. (@xref{Pattern Overview}.)
|
|
|
|
@item POSIX
|
|
The name for a series of standards
|
|
@c being developed by the IEEE
|
|
that specify a Portable Operating System interface. The ``IX'' denotes
|
|
the Unix heritage of these standards. The main standard of interest for
|
|
@command{awk} users is
|
|
@cite{IEEE Standard for Information Technology, Standard 1003.2-1992,
|
|
Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}.
|
|
Informally, this standard is often referred to as simply ``P1003.2.''
|
|
|
|
@item Precedence
|
|
The order in which operations are performed when operators are used
|
|
without explicit parentheses.
|
|
|
|
@item Private
|
|
Variables and/or functions that are meant for use exclusively by library
|
|
functions and not for the main @command{awk} program. Special care must be
|
|
taken when naming such variables and functions.
|
|
(@xref{Library Names}.)
|
|
|
|
@item Range (of input lines)
|
|
A sequence of consecutive lines from the input file(s). A pattern
|
|
can specify ranges of input lines for @command{awk} to process or it can
|
|
specify single lines. (@xref{Pattern Overview}.)
|
|
|
|
@item Recursion
|
|
When a function calls itself, either directly or indirectly.
|
|
If this isn't clear, refer to the entry for ``recursion.''
|
|
|
|
@item Redirection
|
|
Redirection means performing input from something other than the standard input
|
|
stream, or performing output to something other than the standard output stream.
|
|
|
|
You can redirect the output of the @code{print} and @code{printf} statements
|
|
to a file or a system command, using the @samp{>}, @samp{>>}, @samp{|}, and @samp{|&}
|
|
operators. You can redirect input to the @code{getline} statement using
|
|
the @samp{<}, @samp{|}, and @samp{|&} operators.
|
|
(@xref{Redirection},
|
|
and @ref{Getline}.)
|
|
|
|
@item Regexp
|
|
Short for @dfn{regular expression}. A regexp is a pattern that denotes a
|
|
set of strings, possibly an infinite set. For example, the regexp
|
|
@samp{R.*xp} matches any string starting with the letter @samp{R}
|
|
and ending with the letters @samp{xp}. In @command{awk}, regexps are
|
|
used in patterns and in conditional expressions. Regexps may contain
|
|
escape sequences. (@xref{Regexp}.)
|
|
|
|
@item Regular Expression
|
|
See ``regexp.''
|
|
|
|
@item Regular Expression Constant
|
|
A regular expression constant is a regular expression written within
|
|
slashes, such as @code{/foo/}. This regular expression is chosen
|
|
when you write the @command{awk} program and cannot be changed during
|
|
its execution. (@xref{Regexp Usage}.)
|
|
|
|
@item Rule
|
|
A segment of an @command{awk} program that specifies how to process single
|
|
input records. A rule consists of a @dfn{pattern} and an @dfn{action}.
|
|
@command{awk} reads an input record; then, for each rule, if the input record
|
|
satisfies the rule's pattern, @command{awk} executes the rule's action.
|
|
Otherwise, the rule does nothing for that input record.
|
|
|
|
@item Rvalue
|
|
A value that can appear on the right side of an assignment operator.
|
|
In @command{awk}, essentially every expression has a value. These values
|
|
are rvalues.
|
|
|
|
@item Scalar
|
|
A single value, be it a number or a string.
|
|
Regular variables are scalars; arrays and functions are not.
|
|
|
|
@item Search Path
|
|
In @command{gawk}, a list of directories to search for @command{awk} program source files.
|
|
In the shell, a list of directories to search for executable programs.
|
|
|
|
@item Seed
|
|
The initial value, or starting point, for a sequence of random numbers.
|
|
|
|
@item @command{sed}
|
|
See ``Stream Editor.''
|
|
|
|
@item Shell
|
|
The command interpreter for Unix and POSIX-compliant systems.
|
|
The shell works both interactively, and as a programming language
|
|
for batch files, or shell scripts.
|
|
|
|
@item Short-Circuit
|
|
The nature of the @command{awk} logical operators @samp{&&} and @samp{||}.
|
|
If the value of the entire expression is determinable from evaluating just
|
|
the lefthand side of these operators, the righthand side is not
|
|
evaluated.
|
|
(@xref{Boolean Ops}.)
|
|
|
|
@item Side Effect
|
|
A side effect occurs when an expression has an effect aside from merely
|
|
producing a value. Assignment expressions, increment and decrement
|
|
expressions, and function calls have side effects.
|
|
(@xref{Assignment Ops}.)
|
|
|
|
@item Single-Precision
|
|
An internal representation of numbers that can have fractional parts.
|
|
Single-precision numbers keep track of fewer digits than do double-precision
|
|
numbers, but operations on them are sometimes less expensive in terms of CPU time.
|
|
This is the type used by some very old versions of @command{awk} to store
|
|
numeric values. It is the C type @code{float}.
|
|
|
|
@item Space
|
|
The character generated by hitting the space bar on the keyboard.
|
|
|
|
@item Special File
|
|
A @value{FN} interpreted internally by @command{gawk}, instead of being handed
|
|
directly to the underlying operating system---for example, @file{/dev/stderr}.
|
|
(@xref{Special Files}.)
|
|
|
|
@item Stream Editor
|
|
A program that reads records from an input stream and processes them one
|
|
or more at a time. This is in contrast with batch programs, which may
|
|
expect to read their input files in entirety before starting to do
|
|
anything, as well as with interactive programs which require input from the
|
|
user.
|
|
|
|
@item String
|
|
A datum consisting of a sequence of characters, such as @samp{I am a
|
|
string}. Constant strings are written with double quotes in the
|
|
@command{awk} language and may contain escape sequences.
|
|
(@xref{Escape Sequences}.)
|
|
|
|
@item Tab
|
|
The character generated by hitting the @kbd{TAB} key on the keyboard.
|
|
It usually expands to up to eight spaces upon output.
|
|
|
|
@item Text Domain
|
|
A unique name that identifies an application.
|
|
Used for grouping messages that are translated at runtime
|
|
into the local language.
|
|
|
|
@item Timestamp
|
|
A value in the ``seconds since the epoch'' format used by Unix
|
|
and POSIX systems. Used for the @command{gawk} functions
|
|
@code{mktime}, @code{strftime}, and @code{systime}.
|
|
See also ``Epoch'' and ``UTC.''
|
|
|
|
@cindex Linux
|
|
@cindex GNU/Linux
|
|
@cindex Unix
|
|
@cindex BSD-based operating systems
|
|
@cindex NetBSD
|
|
@cindex FreeBSD
|
|
@cindex OpenBSD
|
|
@item Unix
|
|
A computer operating system originally developed in the early 1970's at
|
|
AT&T Bell Laboratories. It initially became popular in universities around
|
|
the world and later moved into commercial environments as a software
|
|
development system and network server system. There are many commercial
|
|
versions of Unix, as well as several work-alike systems whose source code
|
|
is freely available (such as GNU/Linux, NetBSD, FreeBSD, and OpenBSD).
|
|
|
|
@item UTC
|
|
The accepted abbreviation for ``Universal Coordinated Time.''
|
|
This is standard time in Greenwich, England, which is used as a
|
|
reference time for day and date calculations.
|
|
See also ``Epoch'' and ``GMT.''
|
|
|
|
@item Whitespace
|
|
A sequence of space, TAB, or newline characters occurring inside an input
|
|
record or a string.
|
|
@end table
|
|
|
|
@node Copying
|
|
@unnumbered GNU General Public License
|
|
@center Version 2, June 1991
|
|
|
|
@display
|
|
Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc.
|
|
59 Temple Place, Suite 330, Boston, MA 02111, USA
|
|
|
|
Everyone is permitted to copy and distribute verbatim copies
|
|
of this license document, but changing it is not allowed.
|
|
@end display
|
|
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsec Preamble
|
|
|
|
The licenses for most software are designed to take away your
|
|
freedom to share and change it. By contrast, the GNU General Public
|
|
License is intended to guarantee your freedom to share and change free
|
|
software---to make sure the software is free for all its users. This
|
|
General Public License applies to most of the Free Software
|
|
Foundation's software and to any other program whose authors commit to
|
|
using it. (Some other Free Software Foundation software is covered by
|
|
the GNU Library General Public License instead.) You can apply it to
|
|
your programs, too.
|
|
|
|
When we speak of free software, we are referring to freedom, not
|
|
price. Our General Public Licenses are designed to make sure that you
|
|
have the freedom to distribute copies of free software (and charge for
|
|
this service if you wish), that you receive source code or can get it
|
|
if you want it, that you can change the software or use pieces of it
|
|
in new free programs; and that you know you can do these things.
|
|
|
|
To protect your rights, we need to make restrictions that forbid
|
|
anyone to deny you these rights or to ask you to surrender the rights.
|
|
These restrictions translate to certain responsibilities for you if you
|
|
distribute copies of the software, or if you modify it.
|
|
|
|
For example, if you distribute copies of such a program, whether
|
|
gratis or for a fee, you must give the recipients all the rights that
|
|
you have. You must make sure that they, too, receive or can get the
|
|
source code. And you must show them these terms so they know their
|
|
rights.
|
|
|
|
We protect your rights with two steps: (1) copyright the software, and
|
|
(2) offer you this license which gives you legal permission to copy,
|
|
distribute and/or modify the software.
|
|
|
|
Also, for each author's protection and ours, we want to make certain
|
|
that everyone understands that there is no warranty for this free
|
|
software. If the software is modified by someone else and passed on, we
|
|
want its recipients to know that what they have is not the original, so
|
|
that any problems introduced by others will not reflect on the original
|
|
authors' reputations.
|
|
|
|
Finally, any free program is threatened constantly by software
|
|
patents. We wish to avoid the danger that redistributors of a free
|
|
program will individually obtain patent licenses, in effect making the
|
|
program proprietary. To prevent this, we have made it clear that any
|
|
patent must be licensed for everyone's free use or not licensed at all.
|
|
|
|
The precise terms and conditions for copying, distribution and
|
|
modification follow.
|
|
|
|
@ifnotinfo
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsec Terms and Conditions for Copying, Distribution and Modification
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
|
@end ifinfo
|
|
|
|
@enumerate 0
|
|
@item
|
|
This License applies to any program or other work which contains
|
|
a notice placed by the copyright holder saying it may be distributed
|
|
under the terms of this General Public License. The ``Program'', below,
|
|
refers to any such program or work, and a ``work based on the Program''
|
|
means either the Program or any derivative work under copyright law:
|
|
that is to say, a work containing the Program or a portion of it,
|
|
either verbatim or with modifications and/or translated into another
|
|
language. (Hereinafter, translation is included without limitation in
|
|
the term ``modification''.) Each licensee is addressed as ``you''.
|
|
|
|
Activities other than copying, distribution and modification are not
|
|
covered by this License; they are outside its scope. The act of
|
|
running the Program is not restricted, and the output from the Program
|
|
is covered only if its contents constitute a work based on the
|
|
Program (independent of having been made by running the Program).
|
|
Whether that is true depends on what the Program does.
|
|
|
|
@item
|
|
You may copy and distribute verbatim copies of the Program's
|
|
source code as you receive it, in any medium, provided that you
|
|
conspicuously and appropriately publish on each copy an appropriate
|
|
copyright notice and disclaimer of warranty; keep intact all the
|
|
notices that refer to this License and to the absence of any warranty;
|
|
and give any other recipients of the Program a copy of this License
|
|
along with the Program.
|
|
|
|
You may charge a fee for the physical act of transferring a copy, and
|
|
you may at your option offer warranty protection in exchange for a fee.
|
|
|
|
@item
|
|
You may modify your copy or copies of the Program or any portion
|
|
of it, thus forming a work based on the Program, and copy and
|
|
distribute such modifications or work under the terms of Section 1
|
|
above, provided that you also meet all of these conditions:
|
|
|
|
@enumerate a
|
|
@item
|
|
You must cause the modified files to carry prominent notices
|
|
stating that you changed the files and the date of any change.
|
|
|
|
@item
|
|
You must cause any work that you distribute or publish, that in
|
|
whole or in part contains or is derived from the Program or any
|
|
part thereof, to be licensed as a whole at no charge to all third
|
|
parties under the terms of this License.
|
|
|
|
@item
|
|
If the modified program normally reads commands interactively
|
|
when run, you must cause it, when started running for such
|
|
interactive use in the most ordinary way, to print or display an
|
|
announcement including an appropriate copyright notice and a
|
|
notice that there is no warranty (or else, saying that you provide
|
|
a warranty) and that users may redistribute the program under
|
|
these conditions, and telling the user how to view a copy of this
|
|
License. (Exception: if the Program itself is interactive but
|
|
does not normally print such an announcement, your work based on
|
|
the Program is not required to print an announcement.)
|
|
@end enumerate
|
|
|
|
These requirements apply to the modified work as a whole. If
|
|
identifiable sections of that work are not derived from the Program,
|
|
and can be reasonably considered independent and separate works in
|
|
themselves, then this License, and its terms, do not apply to those
|
|
sections when you distribute them as separate works. But when you
|
|
distribute the same sections as part of a whole which is a work based
|
|
on the Program, the distribution of the whole must be on the terms of
|
|
this License, whose permissions for other licensees extend to the
|
|
entire whole, and thus to each and every part regardless of who wrote it.
|
|
|
|
Thus, it is not the intent of this section to claim rights or contest
|
|
your rights to work written entirely by you; rather, the intent is to
|
|
exercise the right to control the distribution of derivative or
|
|
collective works based on the Program.
|
|
|
|
In addition, mere aggregation of another work not based on the Program
|
|
with the Program (or with a work based on the Program) on a volume of
|
|
a storage or distribution medium does not bring the other work under
|
|
the scope of this License.
|
|
|
|
@item
|
|
You may copy and distribute the Program (or a work based on it,
|
|
under Section 2) in object code or executable form under the terms of
|
|
Sections 1 and 2 above provided that you also do one of the following:
|
|
|
|
@enumerate a
|
|
@item
|
|
Accompany it with the complete corresponding machine-readable
|
|
source code, which must be distributed under the terms of Sections
|
|
1 and 2 above on a medium customarily used for software interchange; or,
|
|
|
|
@item
|
|
Accompany it with a written offer, valid for at least three
|
|
years, to give any third party, for a charge no more than your
|
|
cost of physically performing source distribution, a complete
|
|
machine-readable copy of the corresponding source code, to be
|
|
distributed under the terms of Sections 1 and 2 above on a medium
|
|
customarily used for software interchange; or,
|
|
|
|
@item
|
|
Accompany it with the information you received as to the offer
|
|
to distribute corresponding source code. (This alternative is
|
|
allowed only for noncommercial distribution and only if you
|
|
received the program in object code or executable form with such
|
|
an offer, in accord with Subsection b above.)
|
|
@end enumerate
|
|
|
|
The source code for a work means the preferred form of the work for
|
|
making modifications to it. For an executable work, complete source
|
|
code means all the source code for all modules it contains, plus any
|
|
associated interface definition files, plus the scripts used to
|
|
control compilation and installation of the executable. However, as a
|
|
special exception, the source code distributed need not include
|
|
anything that is normally distributed (in either source or binary
|
|
form) with the major components (compiler, kernel, and so on) of the
|
|
operating system on which the executable runs, unless that component
|
|
itself accompanies the executable.
|
|
|
|
If distribution of executable or object code is made by offering
|
|
access to copy from a designated place, then offering equivalent
|
|
access to copy the source code from the same place counts as
|
|
distribution of the source code, even though third parties are not
|
|
compelled to copy the source along with the object code.
|
|
|
|
@item
|
|
You may not copy, modify, sublicense, or distribute the Program
|
|
except as expressly provided under this License. Any attempt
|
|
otherwise to copy, modify, sublicense or distribute the Program is
|
|
void, and will automatically terminate your rights under this License.
|
|
However, parties who have received copies, or rights, from you under
|
|
this License will not have their licenses terminated so long as such
|
|
parties remain in full compliance.
|
|
|
|
@item
|
|
You are not required to accept this License, since you have not
|
|
signed it. However, nothing else grants you permission to modify or
|
|
distribute the Program or its derivative works. These actions are
|
|
prohibited by law if you do not accept this License. Therefore, by
|
|
modifying or distributing the Program (or any work based on the
|
|
Program), you indicate your acceptance of this License to do so, and
|
|
all its terms and conditions for copying, distributing or modifying
|
|
the Program or works based on it.
|
|
|
|
@item
|
|
Each time you redistribute the Program (or any work based on the
|
|
Program), the recipient automatically receives a license from the
|
|
original licensor to copy, distribute or modify the Program subject to
|
|
these terms and conditions. You may not impose any further
|
|
restrictions on the recipients' exercise of the rights granted herein.
|
|
You are not responsible for enforcing compliance by third parties to
|
|
this License.
|
|
|
|
@item
|
|
If, as a consequence of a court judgment or allegation of patent
|
|
infringement or for any other reason (not limited to patent issues),
|
|
conditions are imposed on you (whether by court order, agreement or
|
|
otherwise) that contradict the conditions of this License, they do not
|
|
excuse you from the conditions of this License. If you cannot
|
|
distribute so as to satisfy simultaneously your obligations under this
|
|
License and any other pertinent obligations, then as a consequence you
|
|
may not distribute the Program at all. For example, if a patent
|
|
license would not permit royalty-free redistribution of the Program by
|
|
all those who receive copies directly or indirectly through you, then
|
|
the only way you could satisfy both it and this License would be to
|
|
refrain entirely from distribution of the Program.
|
|
|
|
If any portion of this section is held invalid or unenforceable under
|
|
any particular circumstance, the balance of the section is intended to
|
|
apply and the section as a whole is intended to apply in other
|
|
circumstances.
|
|
|
|
It is not the purpose of this section to induce you to infringe any
|
|
patents or other property right claims or to contest validity of any
|
|
such claims; this section has the sole purpose of protecting the
|
|
integrity of the free software distribution system, which is
|
|
implemented by public license practices. Many people have made
|
|
generous contributions to the wide range of software distributed
|
|
through that system in reliance on consistent application of that
|
|
system; it is up to the author/donor to decide if he or she is willing
|
|
to distribute software through any other system and a licensee cannot
|
|
impose that choice.
|
|
|
|
This section is intended to make thoroughly clear what is believed to
|
|
be a consequence of the rest of this License.
|
|
|
|
@item
|
|
If the distribution and/or use of the Program is restricted in
|
|
certain countries either by patents or by copyrighted interfaces, the
|
|
original copyright holder who places the Program under this License
|
|
may add an explicit geographical distribution limitation excluding
|
|
those countries, so that distribution is permitted only in or among
|
|
countries not thus excluded. In such case, this License incorporates
|
|
the limitation as if written in the body of this License.
|
|
|
|
@item
|
|
The Free Software Foundation may publish revised and/or new versions
|
|
of the General Public License from time to time. Such new versions will
|
|
be similar in spirit to the present version, but may differ in detail to
|
|
address new problems or concerns.
|
|
|
|
Each version is given a distinguishing version number. If the Program
|
|
specifies a version number of this License which applies to it and ``any
|
|
later version'', you have the option of following the terms and conditions
|
|
either of that version or of any later version published by the Free
|
|
Software Foundation. If the Program does not specify a version number of
|
|
this License, you may choose any version ever published by the Free Software
|
|
Foundation.
|
|
|
|
@item
|
|
If you wish to incorporate parts of the Program into other free
|
|
programs whose distribution conditions are different, write to the author
|
|
to ask for permission. For software which is copyrighted by the Free
|
|
Software Foundation, write to the Free Software Foundation; we sometimes
|
|
make exceptions for this. Our decision will be guided by the two goals
|
|
of preserving the free status of all derivatives of our free software and
|
|
of promoting the sharing and reuse of software generally.
|
|
|
|
@ifnotinfo
|
|
@c fakenode --- for prepinfo
|
|
@heading NO WARRANTY
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
@center NO WARRANTY
|
|
@end ifinfo
|
|
|
|
@item
|
|
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
|
|
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN
|
|
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
|
|
PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
|
|
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
|
|
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS
|
|
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE
|
|
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
|
|
REPAIR OR CORRECTION.
|
|
|
|
@item
|
|
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
|
|
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
|
|
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
|
|
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
|
|
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
|
|
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
|
|
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
|
|
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
|
|
POSSIBILITY OF SUCH DAMAGES.
|
|
@end enumerate
|
|
|
|
@ifnotinfo
|
|
@c fakenode --- for prepinfo
|
|
@heading END OF TERMS AND CONDITIONS
|
|
@end ifnotinfo
|
|
@ifinfo
|
|
@center END OF TERMS AND CONDITIONS
|
|
@end ifinfo
|
|
|
|
@page
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsec How to Apply These Terms to Your New Programs
|
|
|
|
If you develop a new program, and you want it to be of the greatest
|
|
possible use to the public, the best way to achieve this is to make it
|
|
free software which everyone can redistribute and change under these terms.
|
|
|
|
To do so, attach the following notices to the program. It is safest
|
|
to attach them to the start of each source file to most effectively
|
|
convey the exclusion of warranty; and each file should have at least
|
|
the ``copyright'' line and a pointer to where the full notice is found.
|
|
|
|
@smallexample
|
|
@var{one line to give the program's name and an idea of what it does.}
|
|
Copyright (C) @var{year} @var{name of author}
|
|
|
|
This program is free software; you can redistribute it and/or
|
|
modify it under the terms of the GNU General Public License
|
|
as published by the Free Software Foundation; either version 2
|
|
of the License, or (at your option) any later version.
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the
|
|
GNU General Public License for more details.
|
|
|
|
You should have received a copy of the GNU General Public License
|
|
along with this program; if not, write to the Free Software
|
|
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111, USA.
|
|
@end smallexample
|
|
|
|
Also add information on how to contact you by electronic and paper mail.
|
|
|
|
If the program is interactive, make it output a short notice like this
|
|
when it starts in an interactive mode:
|
|
|
|
@smallexample
|
|
Gnomovision version 69, Copyright (C) @var{year} @var{name of author}
|
|
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details
|
|
type `show w'. This is free software, and you are welcome
|
|
to redistribute it under certain conditions; type `show c'
|
|
for details.
|
|
@end smallexample
|
|
|
|
The hypothetical commands @samp{show w} and @samp{show c} should show
|
|
the appropriate parts of the General Public License. Of course, the
|
|
commands you use may be called something other than @samp{show w} and
|
|
@samp{show c}; they could even be mouse-clicks or menu items---whatever
|
|
suits your program.
|
|
|
|
You should also get your employer (if you work as a programmer) or your
|
|
school, if any, to sign a ``copyright disclaimer'' for the program, if
|
|
necessary. Here is a sample; alter the names:
|
|
|
|
@smallexample
|
|
@group
|
|
Yoyodyne, Inc., hereby disclaims all copyright
|
|
interest in the program `Gnomovision'
|
|
(which makes passes at compilers) written
|
|
by James Hacker.
|
|
|
|
@var{signature of Ty Coon}, 1 April 1989
|
|
Ty Coon, President of Vice
|
|
@end group
|
|
@end smallexample
|
|
|
|
This General Public License does not permit incorporating your program into
|
|
proprietary programs. If your program is a subroutine library, you may
|
|
consider it more useful to permit linking proprietary applications with the
|
|
library. If this is what you want to do, use the GNU Lesser General
|
|
Public License instead of this License.
|
|
|
|
@node GNU Free Documentation License
|
|
@unnumbered GNU Free Documentation License
|
|
|
|
@cindex FDL (Free Documentation License)
|
|
@cindex Free Documentation License (FDL)
|
|
@cindex GNU Free Documentation License
|
|
@center Version 1.2, November 2002
|
|
|
|
@display
|
|
Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc.
|
|
59 Temple Place, Suite 330, Boston, MA 02111-1307, USA
|
|
|
|
Everyone is permitted to copy and distribute verbatim copies
|
|
of this license document, but changing it is not allowed.
|
|
@end display
|
|
|
|
@enumerate 0
|
|
@item
|
|
PREAMBLE
|
|
|
|
The purpose of this License is to make a manual, textbook, or other
|
|
functional and useful document @dfn{free} in the sense of freedom: to
|
|
assure everyone the effective freedom to copy and redistribute it,
|
|
with or without modifying it, either commercially or noncommercially.
|
|
Secondarily, this License preserves for the author and publisher a way
|
|
to get credit for their work, while not being considered responsible
|
|
for modifications made by others.
|
|
|
|
This License is a kind of ``copyleft'', which means that derivative
|
|
works of the document must themselves be free in the same sense. It
|
|
complements the GNU General Public License, which is a copyleft
|
|
license designed for free software.
|
|
|
|
We have designed this License in order to use it for manuals for free
|
|
software, because free software needs free documentation: a free
|
|
program should come with manuals providing the same freedoms that the
|
|
software does. But this License is not limited to software manuals;
|
|
it can be used for any textual work, regardless of subject matter or
|
|
whether it is published as a printed book. We recommend this License
|
|
principally for works whose purpose is instruction or reference.
|
|
|
|
@item
|
|
APPLICABILITY AND DEFINITIONS
|
|
|
|
This License applies to any manual or other work, in any medium, that
|
|
contains a notice placed by the copyright holder saying it can be
|
|
distributed under the terms of this License. Such a notice grants a
|
|
world-wide, royalty-free license, unlimited in duration, to use that
|
|
work under the conditions stated herein. The ``Document'', below,
|
|
refers to any such manual or work. Any member of the public is a
|
|
licensee, and is addressed as ``you''. You accept the license if you
|
|
copy, modify or distribute the work in a way requiring permission
|
|
under copyright law.
|
|
|
|
A ``Modified Version'' of the Document means any work containing the
|
|
Document or a portion of it, either copied verbatim, or with
|
|
modifications and/or translated into another language.
|
|
|
|
A ``Secondary Section'' is a named appendix or a front-matter section
|
|
of the Document that deals exclusively with the relationship of the
|
|
publishers or authors of the Document to the Document's overall
|
|
subject (or to related matters) and contains nothing that could fall
|
|
directly within that overall subject. (Thus, if the Document is in
|
|
part a textbook of mathematics, a Secondary Section may not explain
|
|
any mathematics.) The relationship could be a matter of historical
|
|
connection with the subject or with related matters, or of legal,
|
|
commercial, philosophical, ethical or political position regarding
|
|
them.
|
|
|
|
The ``Invariant Sections'' are certain Secondary Sections whose titles
|
|
are designated, as being those of Invariant Sections, in the notice
|
|
that says that the Document is released under this License. If a
|
|
section does not fit the above definition of Secondary then it is not
|
|
allowed to be designated as Invariant. The Document may contain zero
|
|
Invariant Sections. If the Document does not identify any Invariant
|
|
Sections then there are none.
|
|
|
|
The ``Cover Texts'' are certain short passages of text that are listed,
|
|
as Front-Cover Texts or Back-Cover Texts, in the notice that says that
|
|
the Document is released under this License. A Front-Cover Text may
|
|
be at most 5 words, and a Back-Cover Text may be at most 25 words.
|
|
|
|
A ``Transparent'' copy of the Document means a machine-readable copy,
|
|
represented in a format whose specification is available to the
|
|
general public, that is suitable for revising the document
|
|
straightforwardly with generic text editors or (for images composed of
|
|
pixels) generic paint programs or (for drawings) some widely available
|
|
drawing editor, and that is suitable for input to text formatters or
|
|
for automatic translation to a variety of formats suitable for input
|
|
to text formatters. A copy made in an otherwise Transparent file
|
|
format whose markup, or absence of markup, has been arranged to thwart
|
|
or discourage subsequent modification by readers is not Transparent.
|
|
An image format is not Transparent if used for any substantial amount
|
|
of text. A copy that is not ``Transparent'' is called ``Opaque''.
|
|
|
|
Examples of suitable formats for Transparent copies include plain
|
|
@sc{ascii} without markup, Texinfo input format, La@TeX{} input
|
|
format, @acronym{SGML} or @acronym{XML} using a publicly available
|
|
@acronym{DTD}, and standard-conforming simple @acronym{HTML},
|
|
PostScript or @acronym{PDF} designed for human modification. Examples
|
|
of transparent image formats include @acronym{PNG}, @acronym{XCF} and
|
|
@acronym{JPG}. Opaque formats include proprietary formats that can be
|
|
read and edited only by proprietary word processors, @acronym{SGML} or
|
|
@acronym{XML} for which the @acronym{DTD} and/or processing tools are
|
|
not generally available, and the machine-generated @acronym{HTML},
|
|
PostScript or @acronym{PDF} produced by some word processors for
|
|
output purposes only.
|
|
|
|
The ``Title Page'' means, for a printed book, the title page itself,
|
|
plus such following pages as are needed to hold, legibly, the material
|
|
this License requires to appear in the title page. For works in
|
|
formats which do not have any title page as such, ``Title Page'' means
|
|
the text near the most prominent appearance of the work's title,
|
|
preceding the beginning of the body of the text.
|
|
|
|
A section ``Entitled XYZ'' means a named subunit of the Document whose
|
|
title either is precisely XYZ or contains XYZ in parentheses following
|
|
text that translates XYZ in another language. (Here XYZ stands for a
|
|
specific section name mentioned below, such as ``Acknowledgements'',
|
|
``Dedications'', ``Endorsements'', or ``History''.) To ``Preserve the Title''
|
|
of such a section when you modify the Document means that it remains a
|
|
section ``Entitled XYZ'' according to this definition.
|
|
|
|
The Document may include Warranty Disclaimers next to the notice which
|
|
states that this License applies to the Document. These Warranty
|
|
Disclaimers are considered to be included by reference in this
|
|
License, but only as regards disclaiming warranties: any other
|
|
implication that these Warranty Disclaimers may have is void and has
|
|
no effect on the meaning of this License.
|
|
|
|
@item
|
|
VERBATIM COPYING
|
|
|
|
You may copy and distribute the Document in any medium, either
|
|
commercially or noncommercially, provided that this License, the
|
|
copyright notices, and the license notice saying this License applies
|
|
to the Document are reproduced in all copies, and that you add no other
|
|
conditions whatsoever to those of this License. You may not use
|
|
technical measures to obstruct or control the reading or further
|
|
copying of the copies you make or distribute. However, you may accept
|
|
compensation in exchange for copies. If you distribute a large enough
|
|
number of copies you must also follow the conditions in section 3.
|
|
|
|
You may also lend copies, under the same conditions stated above, and
|
|
you may publicly display copies.
|
|
|
|
@item
|
|
COPYING IN QUANTITY
|
|
|
|
If you publish printed copies (or copies in media that commonly have
|
|
printed covers) of the Document, numbering more than 100, and the
|
|
Document's license notice requires Cover Texts, you must enclose the
|
|
copies in covers that carry, clearly and legibly, all these Cover
|
|
Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
|
|
the back cover. Both covers must also clearly and legibly identify
|
|
you as the publisher of these copies. The front cover must present
|
|
the full title with all words of the title equally prominent and
|
|
visible. You may add other material on the covers in addition.
|
|
Copying with changes limited to the covers, as long as they preserve
|
|
the title of the Document and satisfy these conditions, can be treated
|
|
as verbatim copying in other respects.
|
|
|
|
If the required texts for either cover are too voluminous to fit
|
|
legibly, you should put the first ones listed (as many as fit
|
|
reasonably) on the actual cover, and continue the rest onto adjacent
|
|
pages.
|
|
|
|
If you publish or distribute Opaque copies of the Document numbering
|
|
more than 100, you must either include a machine-readable Transparent
|
|
copy along with each Opaque copy, or state in or with each Opaque copy
|
|
a computer-network location from which the general network-using
|
|
public has access to download using public-standard network protocols
|
|
a complete Transparent copy of the Document, free of added material.
|
|
If you use the latter option, you must take reasonably prudent steps,
|
|
when you begin distribution of Opaque copies in quantity, to ensure
|
|
that this Transparent copy will remain thus accessible at the stated
|
|
location until at least one year after the last time you distribute an
|
|
Opaque copy (directly or through your agents or retailers) of that
|
|
edition to the public.
|
|
|
|
It is requested, but not required, that you contact the authors of the
|
|
Document well before redistributing any large number of copies, to give
|
|
them a chance to provide you with an updated version of the Document.
|
|
|
|
@item
|
|
MODIFICATIONS
|
|
|
|
You may copy and distribute a Modified Version of the Document under
|
|
the conditions of sections 2 and 3 above, provided that you release
|
|
the Modified Version under precisely this License, with the Modified
|
|
Version filling the role of the Document, thus licensing distribution
|
|
and modification of the Modified Version to whoever possesses a copy
|
|
of it. In addition, you must do these things in the Modified Version:
|
|
|
|
@enumerate A
|
|
@item
|
|
Use in the Title Page (and on the covers, if any) a title distinct
|
|
from that of the Document, and from those of previous versions
|
|
(which should, if there were any, be listed in the History section
|
|
of the Document). You may use the same title as a previous version
|
|
if the original publisher of that version gives permission.
|
|
|
|
@item
|
|
List on the Title Page, as authors, one or more persons or entities
|
|
responsible for authorship of the modifications in the Modified
|
|
Version, together with at least five of the principal authors of the
|
|
Document (all of its principal authors, if it has fewer than five),
|
|
unless they release you from this requirement.
|
|
|
|
@item
|
|
State on the Title page the name of the publisher of the
|
|
Modified Version, as the publisher.
|
|
|
|
@item
|
|
Preserve all the copyright notices of the Document.
|
|
|
|
@item
|
|
Add an appropriate copyright notice for your modifications
|
|
adjacent to the other copyright notices.
|
|
|
|
@item
|
|
Include, immediately after the copyright notices, a license notice
|
|
giving the public permission to use the Modified Version under the
|
|
terms of this License, in the form shown in the Addendum below.
|
|
|
|
@item
|
|
Preserve in that license notice the full lists of Invariant Sections
|
|
and required Cover Texts given in the Document's license notice.
|
|
|
|
@item
|
|
Include an unaltered copy of this License.
|
|
|
|
@item
|
|
Preserve the section Entitled ``History'', Preserve its Title, and add
|
|
to it an item stating at least the title, year, new authors, and
|
|
publisher of the Modified Version as given on the Title Page. If
|
|
there is no section Entitled ``History'' in the Document, create one
|
|
stating the title, year, authors, and publisher of the Document as
|
|
given on its Title Page, then add an item describing the Modified
|
|
Version as stated in the previous sentence.
|
|
|
|
@item
|
|
Preserve the network location, if any, given in the Document for
|
|
public access to a Transparent copy of the Document, and likewise
|
|
the network locations given in the Document for previous versions
|
|
it was based on. These may be placed in the ``History'' section.
|
|
You may omit a network location for a work that was published at
|
|
least four years before the Document itself, or if the original
|
|
publisher of the version it refers to gives permission.
|
|
|
|
@item
|
|
For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
|
|
the Title of the section, and preserve in the section all the
|
|
substance and tone of each of the contributor acknowledgements and/or
|
|
dedications given therein.
|
|
|
|
@item
|
|
Preserve all the Invariant Sections of the Document,
|
|
unaltered in their text and in their titles. Section numbers
|
|
or the equivalent are not considered part of the section titles.
|
|
|
|
@item
|
|
Delete any section Entitled ``Endorsements''. Such a section
|
|
may not be included in the Modified Version.
|
|
|
|
@item
|
|
Do not retitle any existing section to be Entitled ``Endorsements'' or
|
|
to conflict in title with any Invariant Section.
|
|
|
|
@item
|
|
Preserve any Warranty Disclaimers.
|
|
@end enumerate
|
|
|
|
If the Modified Version includes new front-matter sections or
|
|
appendices that qualify as Secondary Sections and contain no material
|
|
copied from the Document, you may at your option designate some or all
|
|
of these sections as invariant. To do this, add their titles to the
|
|
list of Invariant Sections in the Modified Version's license notice.
|
|
These titles must be distinct from any other section titles.
|
|
|
|
You may add a section Entitled ``Endorsements'', provided it contains
|
|
nothing but endorsements of your Modified Version by various
|
|
parties---for example, statements of peer review or that the text has
|
|
been approved by an organization as the authoritative definition of a
|
|
standard.
|
|
|
|
You may add a passage of up to five words as a Front-Cover Text, and a
|
|
passage of up to 25 words as a Back-Cover Text, to the end of the list
|
|
of Cover Texts in the Modified Version. Only one passage of
|
|
Front-Cover Text and one of Back-Cover Text may be added by (or
|
|
through arrangements made by) any one entity. If the Document already
|
|
includes a cover text for the same cover, previously added by you or
|
|
by arrangement made by the same entity you are acting on behalf of,
|
|
you may not add another; but you may replace the old one, on explicit
|
|
permission from the previous publisher that added the old one.
|
|
|
|
The author(s) and publisher(s) of the Document do not by this License
|
|
give permission to use their names for publicity for or to assert or
|
|
imply endorsement of any Modified Version.
|
|
|
|
@item
|
|
COMBINING DOCUMENTS
|
|
|
|
You may combine the Document with other documents released under this
|
|
License, under the terms defined in section 4 above for modified
|
|
versions, provided that you include in the combination all of the
|
|
Invariant Sections of all of the original documents, unmodified, and
|
|
list them all as Invariant Sections of your combined work in its
|
|
license notice, and that you preserve all their Warranty Disclaimers.
|
|
|
|
The combined work need only contain one copy of this License, and
|
|
multiple identical Invariant Sections may be replaced with a single
|
|
copy. If there are multiple Invariant Sections with the same name but
|
|
different contents, make the title of each such section unique by
|
|
adding at the end of it, in parentheses, the name of the original
|
|
author or publisher of that section if known, or else a unique number.
|
|
Make the same adjustment to the section titles in the list of
|
|
Invariant Sections in the license notice of the combined work.
|
|
|
|
In the combination, you must combine any sections Entitled ``History''
|
|
in the various original documents, forming one section Entitled
|
|
``History''; likewise combine any sections Entitled ``Acknowledgements'',
|
|
and any sections Entitled ``Dedications''. You must delete all
|
|
sections Entitled ``Endorsements.''
|
|
|
|
@item
|
|
COLLECTIONS OF DOCUMENTS
|
|
|
|
You may make a collection consisting of the Document and other documents
|
|
released under this License, and replace the individual copies of this
|
|
License in the various documents with a single copy that is included in
|
|
the collection, provided that you follow the rules of this License for
|
|
verbatim copying of each of the documents in all other respects.
|
|
|
|
You may extract a single document from such a collection, and distribute
|
|
it individually under this License, provided you insert a copy of this
|
|
License into the extracted document, and follow this License in all
|
|
other respects regarding verbatim copying of that document.
|
|
|
|
@item
|
|
AGGREGATION WITH INDEPENDENT WORKS
|
|
|
|
A compilation of the Document or its derivatives with other separate
|
|
and independent documents or works, in or on a volume of a storage or
|
|
distribution medium, is called an ``aggregate'' if the copyright
|
|
resulting from the compilation is not used to limit the legal rights
|
|
of the compilation's users beyond what the individual works permit.
|
|
When the Document is included an aggregate, this License does not
|
|
apply to the other works in the aggregate which are not themselves
|
|
derivative works of the Document.
|
|
|
|
If the Cover Text requirement of section 3 is applicable to these
|
|
copies of the Document, then if the Document is less than one half of
|
|
the entire aggregate, the Document's Cover Texts may be placed on
|
|
covers that bracket the Document within the aggregate, or the
|
|
electronic equivalent of covers if the Document is in electronic form.
|
|
Otherwise they must appear on printed covers that bracket the whole
|
|
aggregate.
|
|
|
|
@item
|
|
TRANSLATION
|
|
|
|
Translation is considered a kind of modification, so you may
|
|
distribute translations of the Document under the terms of section 4.
|
|
Replacing Invariant Sections with translations requires special
|
|
permission from their copyright holders, but you may include
|
|
translations of some or all Invariant Sections in addition to the
|
|
original versions of these Invariant Sections. You may include a
|
|
translation of this License, and all the license notices in the
|
|
Document, and any Warrany Disclaimers, provided that you also include
|
|
the original English version of this License and the original versions
|
|
of those notices and disclaimers. In case of a disagreement between
|
|
the translation and the original version of this License or a notice
|
|
or disclaimer, the original version will prevail.
|
|
|
|
If a section in the Document is Entitled ``Acknowledgements'',
|
|
``Dedications'', or ``History'', the requirement (section 4) to Preserve
|
|
its Title (section 1) will typically require changing the actual
|
|
title.
|
|
|
|
@item
|
|
TERMINATION
|
|
|
|
You may not copy, modify, sublicense, or distribute the Document except
|
|
as expressly provided for under this License. Any other attempt to
|
|
copy, modify, sublicense or distribute the Document is void, and will
|
|
automatically terminate your rights under this License. However,
|
|
parties who have received copies, or rights, from you under this
|
|
License will not have their licenses terminated so long as such
|
|
parties remain in full compliance.
|
|
|
|
@item
|
|
FUTURE REVISIONS OF THIS LICENSE
|
|
|
|
The Free Software Foundation may publish new, revised versions
|
|
of the GNU Free Documentation License from time to time. Such new
|
|
versions will be similar in spirit to the present version, but may
|
|
differ in detail to address new problems or concerns. See
|
|
@uref{http://www.gnu.org/copyleft/}.
|
|
|
|
Each version of the License is given a distinguishing version number.
|
|
If the Document specifies that a particular numbered version of this
|
|
License ``or any later version'' applies to it, you have the option of
|
|
following the terms and conditions either of that specified version or
|
|
of any later version that has been published (not as a draft) by the
|
|
Free Software Foundation. If the Document does not specify a version
|
|
number of this License, you may choose any version ever published (not
|
|
as a draft) by the Free Software Foundation.
|
|
@end enumerate
|
|
|
|
@c fakenode --- for prepinfo
|
|
@unnumberedsec ADDENDUM: How to use this License for your documents
|
|
|
|
To use this License in a document you have written, include a copy of
|
|
the License in the document and put the following copyright and
|
|
license notices just after the title page:
|
|
|
|
@smallexample
|
|
@group
|
|
Copyright (C) @var{year} @var{your name}.
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.2
|
|
or any later version published by the Free Software Foundation;
|
|
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
|
|
A copy of the license is included in the section entitled ``GNU
|
|
Free Documentation License''.
|
|
@end group
|
|
@end smallexample
|
|
|
|
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
|
|
replace the ``with...Texts.'' line with this:
|
|
|
|
@smallexample
|
|
@group
|
|
with the Invariant Sections being @var{list their titles}, with
|
|
the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
|
|
being @var{list}.
|
|
@end group
|
|
@end smallexample
|
|
|
|
If you have Invariant Sections without Cover Texts, or some other
|
|
combination of the three, merge those two alternatives to suit the
|
|
situation.
|
|
|
|
If your document contains nontrivial examples of program code, we
|
|
recommend releasing these examples in parallel under your choice of
|
|
free software license, such as the GNU General Public License,
|
|
to permit their use in free software.
|
|
|
|
@c Local Variables:
|
|
@c ispell-local-pdict: "ispell-dict"
|
|
@c End:
|
|
|
|
|
|
@node Index
|
|
@unnumbered Index
|
|
@printindex cp
|
|
|
|
@bye
|
|
|
|
Unresolved Issues:
|
|
------------------
|
|
1. From ADR.
|
|
|
|
Robert J. Chassell points out that awk programs should have some indication
|
|
of how to use them. It would be useful to perhaps have a "programming
|
|
style" section of the manual that would include this and other tips.
|
|
|
|
2. The default AWKPATH search path should be configurable via `configure'
|
|
The default and how this changes needs to be documented.
|
|
|
|
Consistency issues:
|
|
/.../ regexps are in @code, not @samp
|
|
".." strings are in @code, not @samp
|
|
no @print before @dots
|
|
values of expressions in the text (@code{x} has the value 15),
|
|
should be in roman, not @code
|
|
Use TAB and not tab
|
|
Use ESC and not ESCAPE
|
|
Use space and not blank to describe the space bar's character
|
|
The term "blank" is thus basically reserved for "blank lines" etc.
|
|
To make dark corners work, the @value{DARKCORNER} has to be outside
|
|
closing `.' of a sentence and after (pxref{...}). This is
|
|
a change from earlier versions.
|
|
" " should have an @w{} around it
|
|
Use "non-" only with language names or acronyms, or the words bug and option
|
|
Use @command{ftp} when talking about anonymous ftp
|
|
Use uppercase and lowercase, not "upper-case" and "lower-case"
|
|
or "upper case" and "lower case"
|
|
Use "single precision" and "double precision", not "single-precision" or "double-precision"
|
|
Use alphanumeric, not alpha-numeric
|
|
Use POSIX-compliant, not POSIX compliant
|
|
Use --foo, not -Wfoo when describing long options
|
|
Use "Bell Laboratories", but not "Bell Labs".
|
|
Use "behavior" instead of "behaviour".
|
|
Use "zeros" instead of "zeroes".
|
|
Use "nonzero" not "non-zero".
|
|
Use "runtime" not "run time" or "run-time".
|
|
Use "command-line" not "command line".
|
|
Use "online" not "on-line".
|
|
Use "whitespace" not "white space".
|
|
Use "Input/Output", not "input/output". Also "I/O", not "i/o".
|
|
Use "lefthand"/"righthand", not "left-hand"/"right-hand".
|
|
Use "workaround", not "work-around".
|
|
Use "startup"/"cleanup", not "start-up"/"clean-up"
|
|
Use @code{do}, and not @code{do}-@code{while}, except where
|
|
actually discussing the do-while.
|
|
Use "versus" in text and "vs." in index entries
|
|
The words "a", "and", "as", "between", "for", "from", "in", "of",
|
|
"on", "that", "the", "to", "with", and "without",
|
|
should not be capitalized in @chapter, @section etc.
|
|
"Into" and "How" should.
|
|
Search for @dfn; make sure important items are also indexed.
|
|
"e.g." should always be followed by a comma.
|
|
"i.e." should always be followed by a comma.
|
|
The numbers zero through ten should be spelled out, except when
|
|
talking about file descriptor numbers. > 10 and < 0, it's
|
|
ok to use numbers.
|
|
In tables, put command-line options in @code, while in the text,
|
|
put them in @option.
|
|
When using @strong, use "Note:" or "Caution:" with colons and
|
|
not exclamation points. Do not surround the paragraphs
|
|
with @quotation ... @end quotation.
|
|
For most cases, do NOT put a comma before "and", "or" or "but".
|
|
But exercise taste with this rule.
|
|
Don't show the awk command with a program in quotes when it's
|
|
just the program. I.e.
|
|
|
|
{
|
|
....
|
|
}
|
|
|
|
not
|
|
awk '{
|
|
...
|
|
}'
|
|
|
|
Do show it when showing command-line arguments, data files, etc, even
|
|
if there is no output shown.
|
|
|
|
Use numbered lists only to show a sequential series of steps.
|
|
|
|
Use @code{xxx} for the xxx operator in indexing statements, not @samp.
|
|
|
|
Date: Wed, 13 Apr 94 15:20:52 -0400
|
|
From: rms@gnu.org (Richard Stallman)
|
|
To: gnu-prog@gnu.org
|
|
Subject: A reminder: no pathnames in GNU
|
|
|
|
It's a GNU convention to use the term "file name" for the name of a
|
|
file, never "pathname". We use the term "path" for search paths,
|
|
which are lists of file names. Using it for a single file name as
|
|
well is potentially confusing to users.
|
|
|
|
So please check any documentation you maintain, if you think you might
|
|
have used "pathname".
|
|
|
|
Note that "file name" should be two words when it appears as ordinary
|
|
text. It's ok as one word when it's a metasyntactic variable, though.
|
|
|
|
------------------------
|
|
ORA uses filename, thus the macro.
|
|
|
|
Suggestions:
|
|
------------
|
|
Enhance FIELDWIDTHS with some way to indicate "the rest of the record".
|
|
E.g., a length of 0 or -1 or something. May be "n"?
|
|
|
|
Make FIELDWIDTHS be an array?
|
|
|
|
% Next edition:
|
|
% 1. Talk about common extensions, those in nawk, gawk, mawk
|
|
% 2. Use @code{foo} for variables and @code{foo()} for functions
|
|
% 3. Standardize the error messages from the functions and programs
|
|
% in Chapters 12 and 13.
|
|
% 4. Nuke the BBS stuff and use something that won't be obsolete
|
|
% 5. Reorg chapters 5 & 7 like so:
|
|
%Chapter 5:
|
|
% - Constants, Variables, and Conversions
|
|
% + Constant Expressions
|
|
% + Using Regular Expression Constants
|
|
% + Variables
|
|
% + Conversion of Strings and Numbers
|
|
% - Operators
|
|
% + Arithmetic Operators
|
|
% + String Concatenation
|
|
% + Assignment Expressions
|
|
% + Increment and Decrement Operators
|
|
% - Truth Values and Conditions
|
|
% + True and False in Awk
|
|
% + Boolean Expressions
|
|
% + Conditional Expressions
|
|
% - Function Calls
|
|
% - Operator Precedence
|
|
%
|
|
%Chapter 7:
|
|
% - Array Basics
|
|
% + Introduction to Arrays
|
|
% + Referring to an Array Element
|
|
% + Assigning Array Elements
|
|
% + Basic Array Example
|
|
% + Scanning All Elements of an Array
|
|
% - The delete Statement
|
|
% - Using Numbers to Subscript Arrays
|
|
% - Using Uninitialized Variables as Subscripts
|
|
% - Multidimensional Arrays
|
|
% + Scanning Multidimensional Arrays
|
|
% - Sorting Array Values and Indices with gawk
|