unix sort question

unix sort question

Post by mark_f_edwa.. » Tue, 22 Jun 1999 04:00:00



hello all...

does anyone know the sort command syntax if you just want to sort by
column, not field on an hp/ux box??

say i wanted to sort columns 150 thru 155, and then col 25 to, say, 27.

thank you all very much,


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

 
 
 

unix sort question

Post by KG » Tue, 22 Jun 1999 04:00:00


I hate to suggest this, but do a 'man sort'.  It will be the best help.
Sort is a great tool in Unix, but I can never remember the syntax either.
I have a great unix book that I always have to look in to do it.  Sorry
about this suggestion, I'm not trying to be funny.
--KG

> hello all...

> does anyone know the sort command syntax if you just want to sort by
> column, not field on an hp/ux box??

> say i wanted to sort columns 150 thru 155, and then col 25 to, say, 27.

> thank you all very much,


> Sent via Deja.com http://www.deja.com/
> Share what you know. Learn what you don't.


 
 
 

unix sort question

Post by Kurt J. Lanz » Tue, 22 Jun 1999 04:00:00



> hello all...

> does anyone know the sort command syntax if you just want to sort by
> column, not field on an hp/ux box??

> say i wanted to sort columns 150 thru 155, and then col 25 to, say, 27.

> thank you all very much,

Read the man page. Set the field delimiter to an unused character and
sort on selected parts of the first (only) field.
 
 
 

unix sort question

Post by Barry Margoli » Tue, 22 Jun 1999 04:00:00





>> hello all...

>> does anyone know the sort command syntax if you just want to sort by
>> column, not field on an hp/ux box??

>> say i wanted to sort columns 150 thru 155, and then col 25 to, say, 27.

>> thank you all very much,

>Read the man page. Set the field delimiter to an unused character and
>sort on selected parts of the first (only) field.

I've never found the need to set the field delimiter when I do this.  If
the character position is larger than the length of the first field, it
just keeps on going into other fields.

I just tried this in Solaris 2.6 and it still works like this.

--

GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

 
 
 

unix sort question

Post by ryan_snodgr.. » Wed, 23 Jun 1999 04:00:00


I have a question about unix sort performance.  I am sorting a huge
file that has tens of millions of lines and was wondering how to
improve performance.  Here is what I am running right now:

sort -T /disk1 -y -o /disk2/output.txt /disk2/input.txt

This means that input.txt is read from disk2 and all the temporary sort
files are built and merged on disk1 and then the results are written
back to disk2.

Has anyone used the -z buffer_size option?  How does this impact
performance?  How does using -o output_file change versus just piping?
How about specifying a size with the -y flag instead of just using the
built in maximum size or regular defaults.

Thanks for any help!

Ryan

********************************************************************
PlanetRX - Get 3 FREE items from their store!!!
http://www.planetrx.com/3forfree/3forfree.asp?mi=0sH83WR6LyM%3D
********************************************************************

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

 
 
 

unix sort question

Post by Dav » Thu, 24 Jun 1999 04:00:00


Do you have a third disk?
Are these scsi devices? Are they all on the same controller? spreading

things across controllers may help.  I would try to output to a third
disk to minimize contention between input.txt and output.txt.

Is the data totally random in your large file?  If it is partially
sorted according to some sort of definable logic, it may be faster to
write and compile your own sort routine based on the condition of the
data before sorting.

Can you distribute the task?
(i.e. grep out a-e and sort on one unix box, f-k on a second box, etc.
You could just cat everything together at the end.)

Do you have a multiprocessor box, this may help then?

Naturally, you'll need to have a lot of memory...

Just some random thoughts,
good luck,
Dave


> I have a question about unix sort performance.  I am sorting a huge
> file that has tens of millions of lines and was wondering how to
> improve performance.  Here is what I am running right now:

> sort -T /disk1 -y -o /disk2/output.txt /disk2/input.txt

> This means that input.txt is read from disk2 and all the temporary sort
> files are built and merged on disk1 and then the results are written
> back to disk2.

> Has anyone used the -z buffer_size option?  How does this impact
> performance?  How does using -o output_file change versus just piping?
> How about specifying a size with the -y flag instead of just using the
> built in maximum size or regular defaults.

> Thanks for any help!

> Ryan

> ********************************************************************
> PlanetRX - Get 3 FREE items from their store!!!
> http://www.planetrx.com/3forfree/3forfree.asp?mi=0sH83WR6LyM%3D
> ********************************************************************

> Sent via Deja.com http://www.deja.com/
> Share what you know. Learn what you don't.

+---------------------+---------------+


+---------------------+---------------+
 
 
 

unix sort question

Post by Ken Pizzi » Thu, 24 Jun 1999 04:00:00



>sort -T /disk1 -y -o /disk2/output.txt /disk2/input.txt
...
>Has anyone used the -z buffer_size option?  How does this impact
>performance?

The sort on my system has no -y flag, and its -z option has a
different meaning; you may want to mention what version of sort
(including vendor) you're using.

Quote:>  How does using -o output_file change versus just piping?

The difference in cost of "sort -o outfile infile" vs.
"sort infile >outfile" is close to zero.  The only reason
sort has the "-o" option is so that the infile and outfile
can be one-and-the-same ---  i.e., "sort infile >infile"
is just a strange way of creating an empty file, but
"sort -o infile infile" works as desired.

But if you really meant "piping", then it depends on what
you're doing in that pipe.  It is cheaper to:
   sort foo | grep bar >sorted-bar-of-foo
than to:
   sort foo >tmp; grep bar tmp >sorted-bar-of-foo; rm tmp
(then again, it is cheaper still to:
   grep bar foo | sort >sorted-bar-of-foo
); on the other hand it is cheaper to:
   sort foo >bar
than it is to:
   sort foo | cat >bar

                --Ken Pizzini

 
 
 

unix sort question

Post by Ken Pizzi » Thu, 24 Jun 1999 04:00:00



>Do you have a third disk?
>Are these scsi devices? Are they all on the same controller? spreading
>things across controllers may help.  I would try to output to a third
>disk to minimize contention between input.txt and output.txt.

The input and output files will not have any disk contention,
as the input file will have been completetly read and closed
before any output is created.  (Sort can't know that it's seen
what will be the first output record until it has read in all
of the input records.  This is true even when sort doesn't need
to use temp files.)

(Having separate controllers for the input/output file(s)
and for the temp files is still a good suggestion.)

                --Ken Pizzini


>> I have a question about unix sort performance.  I am sorting a huge
>> file that has tens of millions of lines and was wondering how to
>> improve performance.  Here is what I am running right now:

>> sort -T /disk1 -y -o /disk2/output.txt /disk2/input.txt

>> This means that input.txt is read from disk2 and all the temporary sort
>> files are built and merged on disk1 and then the results are written
>> back to disk2.

 
 
 

unix sort question

Post by LJ » Thu, 24 Jun 1999 04:00:00


Hi,

If you do this often, then it would help if you got a commercial product
like Syncsort which does a lot of optimization for you.
Even the any Cobol compiler helps. Thats what I used, and had very good
performance with it.


> I have a question about unix sort performance.  I am sorting a huge
> file that has tens of millions of lines and was wondering how to
> improve performance.  Here is what I am running right now:

> sort -T /disk1 -y -o /disk2/output.txt /disk2/input.txt

> This means that input.txt is read from disk2 and all the temporary sort
> files are built and merged on disk1 and then the results are written
> back to disk2.

> Has anyone used the -z buffer_size option?  How does this impact
> performance?  How does using -o output_file change versus just piping?
> How about specifying a size with the -y flag instead of just using the
> built in maximum size or regular defaults.

> Thanks for any help!

> Ryan

> ********************************************************************
> PlanetRX - Get 3 FREE items from their store!!!
> http://www.planetrx.com/3forfree/3forfree.asp?mi=0sH83WR6LyM%3D
> ********************************************************************

> Sent via Deja.com http://www.deja.com/
> Share what you know. Learn what you don't.

 
 
 

unix sort question

Post by Dav » Fri, 25 Jun 1999 04:00:00


good point.
Thanks Ken

Quote:> >Do you have a third disk?
> >Are these scsi devices? Are they all on the same controller? spreading
> >things across controllers may help.  I would try to output to a third
> >disk to minimize contention between input.txt and output.txt.

> The input and output files will not have any disk contention,
> as the input file will have been completetly read and closed
> before any output is created.  (Sort can't know that it's seen
> what will be the first output record until it has read in all
> of the input records.  This is true even when sort doesn't need
> to use temp files.)

+---------------------+---------------+


+---------------------+---------------+
 
 
 

unix sort question

Post by ryan_snodgr.. » Sat, 26 Jun 1999 04:00:00


Ok, so really the input and output files can be on the same controller
AND disk since the input file will be completely read and written to
tmp (or memory) and then closed before any output is written to the
output file?  So basically you want to have the input/output files on
the same disk and controller while the temp directory is on another
controller/disk?

Thanks!
Ryan



> good point.
> Thanks Ken

> > >Do you have a third disk?
> > >Are these scsi devices? Are they all on the same controller?
spreading
> > >things across controllers may help.  I would try to output to a
third
> > >disk to minimize contention between input.txt and output.txt.

> > The input and output files will not have any disk contention,
> > as the input file will have been completetly read and closed
> > before any output is created.  (Sort can't know that it's seen
> > what will be the first output record until it has read in all
> > of the input records.  This is true even when sort doesn't need
> > to use temp files.)

> +---------------------+---------------+


> +---------------------+---------------+

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.
 
 
 

unix sort question

Post by Ken Pizzi » Sat, 26 Jun 1999 04:00:00



>Ok, so really the input and output files can be on the same controller
>AND disk since the input file will be completely read and written to
>tmp (or memory) and then closed before any output is written to the
>output file?  So basically you want to have the input/output files on
>the same disk and controller while the temp directory is on another
>controller/disk?

It is okay for the input/output files to be either on the same
or on different disks.  But for minimum I/O contention you will
want the temp directory to be on both a controller and disk
distinct from both the input and the output files.

                --Ken Pizzini

 
 
 

unix sort question

Post by Dav » Mon, 28 Jun 1999 04:00:00


Ryan,
This may be of interest to you for your sorting problem.  I don't know
how big the file you're trying to sort is, or what assumptions you can
make about the data in it.  Below is a test I put together for the
following assumption:
* each line begins with a letter between a and z, lower case.  This
can be easily be expanded to cover numbers, caps, punctuation, etc.
Just be sure you understand the ordering details of the sorting in the
language of your character set.

ARGUMENTS
* The length of time to perform a sorting task generally varies in
second order to the number of items to be sorted.  This is often due
to the fact that two passes need to be made at the data for comparison
purposes.  (I'm speaking generally here).
* The time to perform a search, however, generally increases linearly
(first order) with the number of items present.

Hypothesis:
If I could grep a large file repeatedly into several smaller files,
and then sort them asynchronously, I could cat the smaller files back
(in order) to recreate a final sorted file.  For very large files,
this approach might be faster because the increased time for greps to
be completed might be more than offset by the quadratically increasing
time cost of sorting the single large file.

CREATING A LARGE TEST TEXT FILE:
I put together a little test on my Sun Ultra 10 box. First, I created
a *big* text file, using find:
# find / -name "*" > /tmp/base

Then, I made it bigger by executing the following nawk script:
#  nawk ' { n=int(24*rand())+97; printf("%c%s\n",n, $0); } ' base |
more >> base1
{ several times }
# mv /tmp/base1 /tmp/base
# ls -l /tmp
-rw-r--r--   1 root     other    10207716 Jun 26 22:56 base
{other junk}

The file looks like this:
# more base
mlost+found
eusr
husr/lost+found
nusr/platform
xusr/platform/TSBW,8000
eusr/platform/TSBW,Ultra-2i
rusr/platform/sun4u
fusr/platform/sun4u/lib
musr/platform/sun4u/lib/adb
dusr/platform/sun4u/lib/adb/sparcv9
cusr/platform/sun4u/lib/adb/sparcv9/adaptive_mutex
.
.
. (and so on)

Basically, Ive created a 10 + mb text file, where I know that each
line begins with a character between a and z, lower case.

THE BASELINE TEST:
Next, we establish a baseline for the test on my Box...(ymmv):
# time sort /tmp/base > /tmp/base.out

real    0m18.88s
user    0m13.11s
sys     0m2.91s
# time sort /tmp/base > /tmp/base.out

real    0m18.60s
user    0m13.67s
sys     0m2.55s
# time sort /tmp/base > /tmp/base.out

real    0m18.94s
user    0m13.17s
sys     0m2.94s

THE EXPERIMENTAL TEST
Here is the output from breaking the file to be sorted into smaller
pieces (w/ grep) and then sorting:

# time ./mysort > /tmp/mysort.out  

real    0m16.30s
user    0m12.63s
sys     0m2.98s
# time ./mysort > /tmp/mysort.out

real    0m16.24s
user    0m12.48s
sys     0m3.02s
# time ./mysort > /tmp/mysort.out

real    0m16.11s
user    0m12.58s
sys     0m2.86s

COMPARISON:
# ls -l
-rw-r--r--   1 root     other    10207716 Jun 26 22:56 base
-rw-r--r--   1 root     other    10207716 Jun 26 23:16 base.out
-rw-r--r--   1 root     other    10207716 Jun 26 23:19 mysort.out
# diff base.out mysort.out
#

RESULTS/CONCLUSIONS
The results are consistent with my original hypothesis.  I would
expect an even greater performance boost under a multiprocessor
system, as each process spawned by mysort would be able to run on a
less-busy processor.

*two proc test*
Same as above, run on a two processor box:

# timex sort /tmp/base > /tmp/base.out

real       15.52
user       11.03
sys         1.95

# timex ./mysort > /tmp/mybase.out    

real        7.93
user       10.17
sys         2.66

# ls -al *base.out
-rw-r--r--   1 root     other    10207716 Jun 26 23:58 base.out
-rw-r--r--   1 root     other    10207716 Jun 26 23:59 mybase.out
# diff base.out mybase.out

ADDITIONAL INFO/OBSERVATIONS/SUGGESTIONS/CAVEATS
* The tests were conducted on a generic Sun Ultra 10 unix box, single

* The sort algorithm is the 'stock' Solaris 7.0 sort.  More
sophisticated sorts (e.g. by gnu) may narrow the gap.  
* You may be able to tune the performance slightly by adding/removing
'sort engines' (grep, sort pairs) to the script.  This will depend on
your system and your datafile size.
* You should try to balance the size of the intermediate files
(g1,g2,etc) so that each file is as close to possible the same size.
* Note that the script was kept as simple as possible to minimize the
effects of starting up sh, grep, etc.  
* Please note that all IO was directed to /tmp. /tmp on Solaris is RAM
(for the most part and depending on individual configurations).
* Also, the above is based on certain assumptions about the content of
the data.  mysort would likely have to be tweaked, depending on the
contents of the file to be sorted.  
* Note that mysort expects to be using straight ascii.  This script
was NOT WRITTEN WITH INTERNATIONAL LANGUAGE SUPPORT IN MIND.  

Let the buyer beware.  As is. No warranty expressed or implied.
Please test this script thoroughly to make sure it meets your needs.
Your mileage may vary.

Have a nice day,
Dave

----

Here's the mysort script:
#!/bin/sh
grep "^[a-h].*" /tmp/base > /tmp/base.g1 &
grep "^[i-p].*" /tmp/base > /tmp/base.g2 &
grep "^[q-z].*" /tmp/base > /tmp/base.g3 &
wait
sort /tmp/base.g1 > /tmp/base.g1.out &
sort /tmp/base.g2 > /tmp/base.g2.out &
sort /tmp/base.g3 > /tmp/base.g3.out &
wait
cat /tmp/base.g1.out /tmp/base.g2.out /tmp/base.g3.out

+---------------------+---------------+


+---------------------+---------------+

 
 
 

unix sort question

Post by Ken Pizzi » Mon, 28 Jun 1999 04:00:00



>Hypothesis:
>If I could grep a large file repeatedly into several smaller files,
>and then sort them asynchronously, I could cat the smaller files back
>(in order) to recreate a final sorted file.  For very large files,
>this approach might be faster because the increased time for greps to
>be completed might be more than offset by the quadratically increasing
>time cost of sorting the single large file.

While partitioning the file in several linear passes may indeed
speed up a sort, it is not the case that the time cost of sorting
is a quadradic function of the input size --- it is a N*log(N)
function.  This still leaves a fair margin for improvement by
special knowledge of the input (e.g., that the first byte of
the sort key is more-or-less equidistributed over some limited
alphabet), but leaves less wiggle room than a quadradic sort
would (and so more care is required to ensure that any preprocessing
doesn't eat more time than is saved).

                --Ken Pizzini

 
 
 

1. UNIX sort exponents (bc|sort)

Q: Any idea how to UNIX 'sort' the fifth column
   (in the file below) such that the exponent is respected?

csh% sort -r -n +5 foo
     Where file foo contains dummy data:
     a b c d f 2e2
     a b c d f 2e-2
     a b c d f 3e2
     a b c d f 3e-2
Does not respect the radix.
Note: Column zero is the first column to the 'sort' command.

Any idea what to feed UNIX sort to make it respect the exponent?
    Mostly, I've tried 'bin/sort' and '/bin/bc' (Sun Solaris 7).

thx,
John

2. Corrupted wtmp file?

3. lk-changelog.pl 0.33 (new option)

4. sar output

5. What is "Out of virtual memory" message?

6. can unix sort routine sort w/ non ASCII collating sequence?

7. File Sort: How good is Unix sort?

8. UNIX sort exponents (bc|sort)

9. sort sort: 0653-657 A write error occurred while sorting (4.1.3)