sorting fixed length records/fields (large files)

sorting fixed length records/fields (large files)

Post by jrw32.. » Sat, 23 Oct 1999 04:00:00



It seems that the standard unix system sort is pretty flexible for files
with variable length records (i.e. newline terminated records) and for
records which use some character to delimit the extent of the field.

But what about fixed-length fields?  It seems that the only way to
handle them is to try to find some character which is never going to be
in any record and use that as the field delimiter and then take offsets
into the first field as your sort keys.  That's pretty awful.  Is there
a better way to do it?

What about files containing fixed length records which are not
newline-delimited?

I know I could write such sorts in perl or C or whatever, but I want to
use the standard sort because I need to sort huge files (1+ gigabytes)
and I don't want to rewrite the chunking/merging capabilities built into
the standard sort.

Is the only alternative to buy some commercial product like SyncSort?

Thanks,
John Wiersba

Sent via Deja.com http://www.deja.com/
Before you buy.

 
 
 

sorting fixed length records/fields (large files)

Post by brian_hunt.. » Sat, 23 Oct 1999 04:00:00




> It seems that the standard unix system sort is pretty flexible for
files
> with variable length records (i.e. newline terminated records) and for
> records which use some character to delimit the extent of the field.

> But what about fixed-length fields?

If what you're asking is "how do I sort on the Nth character of a line?"
then the standard UNIX sort can do this:

        sort -o outfile +0.35 infile

which means start sorting on the 36th (35+1) character of the 1st (0+1)
field.

I recall Syncsort actually falls down on this, in "Unix sort compatable
mode" as they interprete the 0th field more rigidly. Any UNIX I've ever
tried this on handles it well.

Sent via Deja.com http://www.deja.com/
Before you buy.

 
 
 

sorting fixed length records/fields (large files)

Post by jrw32.. » Sat, 23 Oct 1999 04:00:00




>    sort -o outfile +0.35 infile
> which means start sorting on the 36th (35+1) character of the 1st
> (0+1) field.

I see this in the GNU man page now, but other man pages don't make
this explicit.  I assumed that a field stopped at the next occurance of
a field delimiter.  So, apparently, a shorter "field" could sort after a
longer one if the field delimiter had a higher character value than the
corresponding character in the longer field!  That seems bizzare!  But
it does allow for fixed-length fields.

Now, what about fixed-length records?

--
John Wiersba

Sent via Deja.com http://www.deja.com/
Before you buy.

 
 
 

sorting fixed length records/fields (large files)

Post by Al Shark » Sat, 23 Oct 1999 04:00:00



> It seems that the standard unix system sort is pretty flexible for
> files with variable length records (i.e. newline terminated records)
> and for records which use some character to delimit the extent of
> the field.

> But what about fixed-length fields?

sort -k1.23,1.36
will sort by positions 23-36.  Ordering options can still be applied,
and multiple sort keys can be used.

Quote:> What about files containing fixed length records which are not
> newline-delimited?

Assuming printable ASCII data files:

fold -wNUMBER filename | sort -k1.23,1.36 | tr -d "\012" > outfile

where NUMBER is your desired width.  Fold inserts newlines, and tr
removes them afterwards.  If you have integers or packed decimal
numbers, you're out of luck with standard sort though.

 
 
 

sorting fixed length records/fields (large files)

Post by jrw32.. » Tue, 26 Oct 1999 04:00:00




Quote:> sort -k1.23,1.36
> will sort by positions 23-36.  Ordering options can still be applied,
> and multiple sort keys can be used.

I was interpreting "field" to mean that the field was actually delimited
by the delimiters, instead of the apparently actual practice of using
the delimiter only to find the start of the field.  As I mentioned in a
reply to another poster, this practice can lead to anomalous
(i.e. unexpected) behavior, but so be it.

Quote:> Assuming printable ASCII data files:

> fold -wNUMBER filename | sort -k1.23,1.36 | tr -d "\012" > outfile

> where NUMBER is your desired width.  Fold inserts newlines, and tr
> removes them afterwards.  If you have integers or packed decimal
> numbers, you're out of luck with standard sort though.

Thanks -- this will do the trick assuming no newlines in the data.

--
John Wiersba

Sent via Deja.com http://www.deja.com/
Before you buy.