Commit Graph

29 Commits

Author SHA1 Message Date
joerg
2b8a053617 Retire __SCCSID. It has only archeological value now. Also retire lint
conditional around __RCSID, lint can handle that fine.
2009-11-06 18:34:22 +00:00
dsl
41b3ada21c When we need to merge more than 16 files, do them in a hierarchy.
Reduces the amount of data written to temporary files.
The 3-level stack has to do a simple reduce after 4352 input files, for
a normal file sort this is 35GB of data or about 500 million records.
This needs about 50 open fd's - which should be ok.
Clearly the merge sort could process more input files in one go - speeding
up the sort, but at some point the number of input files would exceed
whatever limit was applied.
2009-10-09 20:29:43 +00:00
dsl
6458ae9cdf Move all the fopen() calls out of the record read routines into the callers.
Split the merge sort so that fsort() can pass the 'FILE *' of the temporary
files to be merged into the merge code.
Don't rely on realloc() not moving the end address of a buffer!
Rework merge sort so that it sorts pointers to 'struct mfile' and only
copies about sort record descriptors.
No functional change intended.
2009-09-26 21:16:55 +00:00
dsl
1310aa04b4 Save length of key instead of relying of the weight of the record sep.
This frees a byte value to use for 'end of key' (to correctly sort
short keys) while still having a weight assigned to the field sep.
(Unless -t is given, the field sep is in the field data.)
Do reverse sorts by writing the output file in reverse order (rather
than reversing the sort - apart from merges).
All key compares are now unweighted.
For 'sort -u' mark duplicates keys during the sort and don't write
to the output.
Use -S to mean a posix sort - where equal keys are sorted using the
raw record (rather than being kept in the original order).
For 'sort -f' (no keys) generate a key of the folded data (as for -n
-i and -d), simplifies the code and allows a 'posix' sort.
2009-09-10 22:02:40 +00:00
dsl
2abdfb3907 Now we have our own radix_sort() change the interface so that we pass
an array of 'RECHEADER *' and remove all the crappy stuff that backed up
by REC_DATA_OFFSET (etc).
Also change radix_sort() to return the number of elements, soon to be used
to drop duplicate keys (for sort -u).
2009-09-05 12:00:25 +00:00
dsl
609b8532b4 Add some comments and clarifications to this inpeneterable code.
When merging ensure we accurable sort records with identical keys by
file-number, otherwise a 'stable' sort won't be!
2009-08-22 15:16:50 +00:00
dsl
7b4a02befd Rework the way sort generates sort keys:
- If we generate a key, it is always sortable using memcmp()
- If we are sorting the whole record, then a weight-table must be used
  during compares.
- Major surgery to encoding of numbers to ensure unique keys for equal
  numeric values.  Reverse numerics are handled by inverting the sign.
- Case folding (-f) is handled when the sort keys are generated. No other
  code has to care at all.
- Key uniqueness (-u) is done during merge for large datasets. It only
  has to be done when writing the output file for small files.
  Since the file is in key order this is simple!
Probably fixes all of: PR/27257 PR/25551 PR/22182 PR/31095 PR/30504
PR/36816 PR/37860 PR/39308
Also PR/18614 should no longer die, but a little more work needs to be
done on the merging for very large files.
2009-08-22 10:53:28 +00:00
dsl
bf80c84843 Delete more unwanted/unused cruft.
Simplify logic for reading input records.
Do a merge sort whenever we have 16 partial sorted blocks.
The patient is breathing, but still carrying a lot of extra weight.
2009-08-20 06:36:25 +00:00
dsl
9ab8b68075 Replace all uses of sizeof(TRECHEADER) with REC_DATA_OFFSET - which
is defined as offsetof(RECHEADER, data).  Delete TRECHEADER.
2009-08-16 19:53:43 +00:00
dsl
477a33f936 linebuf and linebuf_size are only used inside seq() - which also not
only has its own static variable, but will also extend the buffer.
Remove linebuf/size and change seq() to use a private, locally managed
buffer.
2009-08-15 16:50:29 +00:00
dsl
2a0ab276a2 Ansify.
I'm looking at fixing the 'sort -n' fubars, but this code is an
inpeneterable mess - which needs some fixing first!
2009-08-15 09:48:46 +00:00
martin
ce099b4099 Remove clause 3 and 4 from TNF licenses 2008-04-28 20:22:51 +00:00
jdolecek
bf96399c09 initialize malloc()ated memory 2004-02-17 19:09:36 +00:00
itojun
ba02466ff9 KNF (mostly whitespace) 2003-10-18 03:03:20 +00:00
itojun
50847da5c5 safer use of realloc 2003-10-16 06:56:17 +00:00
jdolecek
f84513a754 add TNF copyright 2003-08-07 11:32:34 +00:00
agc
89aaa1bb64 Move UCB-licensed code from 4-clause to 3-clause licence.
Patches provided by Joel Baker in PR 22365, verified by myself.
2003-08-07 11:13:06 +00:00
jdolecek
8cf8af1c13 get rid of one memmove() (not very significant)
remove ()'s from error messages
move some error checks immediatelly after appropriate realloc() calls
2003-03-20 16:13:03 +00:00
jdolecek
63ae9a5e5f make function merge() static in msort.c
cosmetic change to how local variable is incremented (moved to for(;;))
2002-12-25 21:19:15 +00:00
jdolecek
affba8f2e9 Pull up various cosmetic (mostly whitespace) changes from OpenBSD.
This is primarily to ease syncing the two versions.
2001-02-19 20:50:17 +00:00
jdolecek
f65ee1b182 merge(): use array of buffers instead of one big buffer for all records, and
enlarge them as necessary to read records from merged files; the buffers
	are allocated once per program run, so there shouldn't be any
	performance difference
This makes sort(1) pass also regression 40B and should make it
fully arbitrary long record capable.
XXX the buffer array could probably be freed on end of fmerge() to save memory
2001-01-19 10:50:31 +00:00
jdolecek
f2deab8a4c when merging stuff from several files, make merge handle records correctly
for stable sort so that the records are not swapped arbitrarily - this makes
in-tree BSD sort(1) pass regression test 38

while here, do couple of cleanups, like s/16/MERGE_FNUM/ where appropriate,
making local stuff static and some intendation/code format changes
2001-01-13 10:33:30 +00:00
jdolecek
1c216f18ea general cleanup of file list passing:
* get rid of union f_handle, replace by passing explicit int parameter
  and (new) struct filelist
* add new typedefs gen_func_t and put_func_t and use where appropriate
2001-01-11 14:05:24 +00:00
jdolecek
fe7f0860c0 order(): since getline()/getnext() behaviour wrt passed
end pointer has changed (full buffer is used instead of first DEFLLEN bytes)
the end pointer cannot be shared for crec and prec, we need to pass
different value in each case
2000-10-17 15:16:27 +00:00
jdolecek
ed1fd80898 constify, rename MAXLLEN to DEFLLEN 2000-10-16 21:42:21 +00:00
jdolecek
681fb9cb36 don't use register declarations 2000-10-15 20:46:33 +00:00
bjh21
e5218d1719 Two classes of changes from the initial OpenBSD commit of this sort(1):
FILE * variables are called "fp" rather than "fd".
Better (safer) temporary-file handling.
2000-10-07 20:37:06 +00:00
bjh21
6029888a3a Hit sort(1) with a hammer till it compiles.
Also add RCSIDs.
2000-10-07 18:37:09 +00:00
bjh21
1d5d9b5b60 4.4BSD-Lite2 contrib/sort 2000-10-07 16:39:34 +00:00