Commit Graph

188 Commits

Author SHA1 Message Date
wiz
33d2a9cdc6 Sort sections. 2010-12-18 23:36:23 +00:00
christos
7e6e5c1f48 Add an 'l' style for sorting that sorts by the string length of the field. 2010-12-18 23:09:48 +00:00
wiz
90abead58e Fix typo in comment. 2010-06-06 00:00:33 +00:00
dholland
fcf4d3f750 Rework previous change to fixit() to not trip on option arguments. (Noticed
by wiz.) Clarify the loop logic involved.
2010-06-05 17:46:08 +00:00
dholland
8696c1b71e fixit() needs to know the getopt options list to do its thing correctly. 2010-06-05 17:44:51 +00:00
dholland
b6360c7f71 Don't recognize "+3" after -- or after the first non-option argument.
This prevents converting "+3" into "-k4.1" in places where getopt
won't recognize it, which in turn prevents silly error messages and
lossage trying to sort files whose names begin with +. PR 43358.
2010-05-27 05:52:29 +00:00
jruoho
3ae25c77b6 RETURN VALUES -> EXIT STATUS. 2010-05-14 16:58:32 +00:00
enami
47e571f2ea Don't touch past the end of allocated region. It results segmentation
violation.
2010-02-05 21:58:41 +00:00
joerg
2b8a053617 Retire __SCCSID. It has only archeological value now. Also retire lint
conditional around __RCSID, lint can handle that fine.
2009-11-06 18:34:22 +00:00
dsl
43682b02ee If anyone is stupid enough to feed records longer than 8MB into sort, don't
sit in an infinite loop, instead eat memory until we have read 8 records.
2009-10-09 20:32:57 +00:00
dsl
41b3ada21c When we need to merge more than 16 files, do them in a hierarchy.
Reduces the amount of data written to temporary files.
The 3-level stack has to do a simple reduce after 4352 input files, for
a normal file sort this is 35GB of data or about 500 million records.
This needs about 50 open fd's - which should be ok.
Clearly the merge sort could process more input files in one go - speeding
up the sort, but at some point the number of input files would exceed
whatever limit was applied.
2009-10-09 20:29:43 +00:00
dsl
768e6fa973 Don't give merge an empty file when we detect EOF with nothing in our
buffer.
2009-10-09 20:23:19 +00:00
dsl
8b6ec7b129 long align records written to temporary files. 2009-10-07 21:03:29 +00:00
dsl
5aa782f502 When encoding numbers, we can use all 8 bits for exponent values. 2009-10-07 21:02:57 +00:00
dsl
eab2f96cf5 Fix borked fix for sort relying on realloc() changing the buffer end.
Sorts of more than 8MB data now probably work again.
2009-09-28 20:30:01 +00:00
dsl
6458ae9cdf Move all the fopen() calls out of the record read routines into the callers.
Split the merge sort so that fsort() can pass the 'FILE *' of the temporary
files to be merged into the merge code.
Don't rely on realloc() not moving the end address of a buffer!
Rework merge sort so that it sorts pointers to 'struct mfile' and only
copies about sort record descriptors.
No functional change intended.
2009-09-26 21:16:55 +00:00
dsl
800732bfdc Fix sort -u, PR/42094 2009-09-19 16:18:00 +00:00
dsl
fe52672374 Minor tweaks to the key generation for numeric fields.
Use 1's compliment for -ve numbers to avoid confitionals.
2009-09-16 20:56:38 +00:00
dsl
1310aa04b4 Save length of key instead of relying of the weight of the record sep.
This frees a byte value to use for 'end of key' (to correctly sort
short keys) while still having a weight assigned to the field sep.
(Unless -t is given, the field sep is in the field data.)
Do reverse sorts by writing the output file in reverse order (rather
than reversing the sort - apart from merges).
All key compares are now unweighted.
For 'sort -u' mark duplicates keys during the sort and don't write
to the output.
Use -S to mean a posix sort - where equal keys are sorted using the
raw record (rather than being kept in the original order).
For 'sort -f' (no keys) generate a key of the folded data (as for -n
-i and -d), simplifies the code and allows a 'posix' sort.
2009-09-10 22:02:40 +00:00
dsl
2abdfb3907 Now we have our own radix_sort() change the interface so that we pass
an array of 'RECHEADER *' and remove all the crappy stuff that backed up
by REC_DATA_OFFSET (etc).
Also change radix_sort() to return the number of elements, soon to be used
to drop duplicate keys (for sort -u).
2009-09-05 12:00:25 +00:00
dsl
4611f32c1c Include a local copy of the sradixsort() code from libc.
Currently unchanged apart from the deletion of the 'unstable' version and
other unneeded code.
Use fldtab[0]. not fldtab-> when we are referring to the global info
in the 0th entry to emphasise that this entry is different.
fldtab[0].weights is only needed in the SINGL_FLD case - so set it there.
Re-indent a big 'if' is setfield() so that the line breaks match the
logic - which looks dubious now!
2009-09-05 09:16:18 +00:00
wiz
ea72fa6ee9 Fix pasto. 2009-08-23 15:45:08 +00:00
dsl
5166e91c70 Bring nearer to reality.
Note that -H is now ignored.
Move -S and -s (and -H) to the first list of options since they are
global ones, not ones that override the ordering rules.
2009-08-22 21:55:08 +00:00
dsl
5c6e557c4b <space> and <tab> at the start of key fields are supposed to be sorted
as if part of the data.
This is a bit fubar since we need a value than sorts before any byte value
as a key field separator - so need 257 byte values (since radixsort() doesn't
take a length for each record).
For now map '\t' to 0x01 and hope no one will notice!
2009-08-22 21:50:32 +00:00
dsl
e0846c3698 Put radixsort() and sradixsort() the correct way around. 2009-08-22 21:43:53 +00:00
dsl
f58fe5e68a Fix generation of unmasked alpha keys. 2009-08-22 21:28:55 +00:00
dsl
b36440a064 Only process each number digit once. 2009-08-22 21:19:40 +00:00
dsl
609b8532b4 Add some comments and clarifications to this inpeneterable code.
When merging ensure we accurable sort records with identical keys by
file-number, otherwise a 'stable' sort won't be!
2009-08-22 15:16:50 +00:00
dsl
7b4a02befd Rework the way sort generates sort keys:
- If we generate a key, it is always sortable using memcmp()
- If we are sorting the whole record, then a weight-table must be used
  during compares.
- Major surgery to encoding of numbers to ensure unique keys for equal
  numeric values.  Reverse numerics are handled by inverting the sign.
- Case folding (-f) is handled when the sort keys are generated. No other
  code has to care at all.
- Key uniqueness (-u) is done during merge for large datasets. It only
  has to be done when writing the output file for small files.
  Since the file is in key order this is simple!
Probably fixes all of: PR/27257 PR/25551 PR/22182 PR/31095 PR/30504
PR/36816 PR/37860 PR/39308
Also PR/18614 should no longer die, but a little more work needs to be
done on the merging for very large files.
2009-08-22 10:53:28 +00:00
dsl
bf80c84843 Delete more unwanted/unused cruft.
Simplify logic for reading input records.
Do a merge sort whenever we have 16 partial sorted blocks.
The patient is breathing, but still carrying a lot of extra weight.
2009-08-20 06:36:25 +00:00
dsl
f155f3b8b9 The code that attempted to sort large files by sorting each chunk by the
first key byte and writing to a temp file, then sorting the records from
each temp file that had the same first key byte (and repeating for upto
4 key bytes) was a nice idea, but completely doomed to failure.
Eg PR/9308 where a 70MB file has all but one record the same and short keys.
Not only does the code not work, it is rather guaranteed to be slow.
Instead always use a merge sort for fully sorted chunk of records (each
temporary file contains one lot of sorted records).
The -H option already did this, so just rip out all the code and variables
that can't be used when -H was specified.
Further cleanup to come ...
2009-08-18 18:00:28 +00:00
dsl
fa81e78b3d 'depth' is used for the number of bytes into the key that the pointers
reference, when we want to find the record header put the larger value
into 'hdr_off' to avoid any confusion that the code might be changing
'depth'!
There is now no need to save the original value as 'odepth' in append.c.
All an a vague attempt to make this code slightly readable.
2009-08-16 20:02:04 +00:00
dsl
9ab8b68075 Replace all uses of sizeof(TRECHEADER) with REC_DATA_OFFSET - which
is defined as offsetof(RECHEADER, data).  Delete TRECHEADER.
2009-08-16 19:53:43 +00:00
dsl
59ede5ae24 Always add an REC_D char (usually \n) as the last sort key char - we
almost always need one.
But do ADD it, instead of overwriting the last byte of the last key since
that may be requesting the other end of the sort order.
There is no need to check for space for the line after adding the key,
but we might as well check before - just to optimise that case.
This might fix some of the sort bugs - but not the one I'm looking at!
2009-08-15 21:26:32 +00:00
dsl
9987745061 Remove reference to db.h by using separate ptr+len fields for the only
structure that used it.
Pass end of keybuf area, not size to enterkey() - largely to remove a
variable who'se use isn't obvious from the name!
The structute of this code sucks.
2009-08-15 18:40:01 +00:00
dsl
477a33f936 linebuf and linebuf_size are only used inside seq() - which also not
only has its own static variable, but will also extend the buffer.
Remove linebuf/size and change seq() to use a private, locally managed
buffer.
2009-08-15 16:50:29 +00:00
dsl
5e8c7b5dbd Remove the unused 'DBT *key' parameter from seq(). 2009-08-15 16:10:40 +00:00
dsl
a3b5c4400f In makeline() change 'pos' from 'char *' to 'u_char *' and remove all
the casts associated with its use.
None of the uses can possibly care about the signedness of the pointer.
2009-08-15 14:31:48 +00:00
dsl
2a0ab276a2 Ansify.
I'm looking at fixing the 'sort -n' fubars, but this code is an
inpeneterable mess - which needs some fixing first!
2009-08-15 09:48:46 +00:00
lukem
c1ceae17f0 Enable WARNS=4 by default for usr.bin, except for:
awk  bdes  checknr  compile_et  error  gss  hxtool  kgetcred  kinit
	klist  ldd  less  lex  locale  login  m4  man  menuc  mk_cmds
	mklocale  msgc  openssl  rpcgen  rpcinfo  sdiff  spell  ssh
	string2key  telnet  tn3270  verify_krb5_conf  xlint
2009-04-14 22:15:16 +00:00
lukem
64d3192b1d Fix WARNS=4 issues (-Wcast-qual -Wsign-compare) 2009-04-13 11:07:59 +00:00
joerg
8929e0dce4 Don't workaround ancient macro argument limit with .Xo/.Xc. 2009-03-11 13:58:29 +00:00
christos
079a9a0235 Make -R accept numeric arguments so one can say -R '\0' to be used in
pipelines like find . -print0 | sort -R '\0'. From Anon Ymous
2008-11-08 17:11:56 +00:00
lukem
98e5374ccb Remove the \n and tabs from the __COPYRIGHT() strings.
Tweak to use a consistent format.
2008-07-21 14:19:20 +00:00
martin
cd22f25e6f Move TNF licenses to 2 clause form 2008-05-02 18:11:04 +00:00
martin
ce099b4099 Remove clause 3 and 4 from TNF licenses 2008-04-28 20:22:51 +00:00
hubertf
f2799c52e5 <ctype.h> is unused. What's still needed is <sys/cdefs.h> (which is
usually included at that place anyways).

From Slava Semushin <slava.semushin@gmail.com>.
2007-02-21 20:15:17 +00:00
jdolecek
d1de60425b fix check for field order to allow .0 form in "-k 1.2,1.0"
fix provided in PR bin/25572 by Ross Patterson
2006-10-23 20:36:17 +00:00
jdolecek
a2e8970e19 when using -o into file which already exists, copy the permissions
of the original file to the new (sorted) file

adresses PR bin/26860 by Michael van Elst
2006-10-23 19:53:25 +00:00
jdolecek
bfa086e40a replace access(2) + /dev/ prefix check with lstat(2) and S_ISCHR()/S_ISBLK()
part of PR bin/26860 by Michael van Elst

while here, put output file fopen() inside the code block of the
only code path where it's actually needed, to make the logic more obvious;
and in the "stdout" case, initialize toutpath to empty string rather
then /dev/stdout, to make it clear /dev/stdout is not actually used
2006-10-23 19:39:54 +00:00