Ryan,
This may be of interest to you for your sorting problem. I don't know
how big the file you're trying to sort is, or what assumptions you can
make about the data in it. Below is a test I put together for the
following assumption:
* each line begins with a letter between a and z, lower case. This
can be easily be expanded to cover numbers, caps, punctuation, etc.
Just be sure you understand the ordering details of the sorting in the
language of your character set.
ARGUMENTS
* The length of time to perform a sorting task generally varies in
second order to the number of items to be sorted. This is often due
to the fact that two passes need to be made at the data for comparison
purposes. (I'm speaking generally here).
* The time to perform a search, however, generally increases linearly
(first order) with the number of items present.
Hypothesis:
If I could grep a large file repeatedly into several smaller files,
and then sort them asynchronously, I could cat the smaller files back
(in order) to recreate a final sorted file. For very large files,
this approach might be faster because the increased time for greps to
be completed might be more than offset by the quadratically increasing
time cost of sorting the single large file.
CREATING A LARGE TEST TEXT FILE:
I put together a little test on my Sun Ultra 10 box. First, I created
a *big* text file, using find:
# find / -name "*" > /tmp/base
Then, I made it bigger by executing the following nawk script:
# nawk ' { n=int(24*rand())+97; printf("%c%s\n",n, $0); } ' base |
more >> base1
{ several times }
# mv /tmp/base1 /tmp/base
# ls -l /tmp
-rw-r--r-- 1 root other 10207716 Jun 26 22:56 base
{other junk}
The file looks like this:
# more base
mlost+found
eusr
husr/lost+found
nusr/platform
xusr/platform/TSBW,8000
eusr/platform/TSBW,Ultra-2i
rusr/platform/sun4u
fusr/platform/sun4u/lib
musr/platform/sun4u/lib/adb
dusr/platform/sun4u/lib/adb/sparcv9
cusr/platform/sun4u/lib/adb/sparcv9/adaptive_mutex
.
.
. (and so on)
Basically, Ive created a 10 + mb text file, where I know that each
line begins with a character between a and z, lower case.
THE BASELINE TEST:
Next, we establish a baseline for the test on my Box...(ymmv):
# time sort /tmp/base > /tmp/base.out
real 0m18.88s
user 0m13.11s
sys 0m2.91s
# time sort /tmp/base > /tmp/base.out
real 0m18.60s
user 0m13.67s
sys 0m2.55s
# time sort /tmp/base > /tmp/base.out
real 0m18.94s
user 0m13.17s
sys 0m2.94s
THE EXPERIMENTAL TEST
Here is the output from breaking the file to be sorted into smaller
pieces (w/ grep) and then sorting:
# time ./mysort > /tmp/mysort.out
real 0m16.30s
user 0m12.63s
sys 0m2.98s
# time ./mysort > /tmp/mysort.out
real 0m16.24s
user 0m12.48s
sys 0m3.02s
# time ./mysort > /tmp/mysort.out
real 0m16.11s
user 0m12.58s
sys 0m2.86s
COMPARISON:
# ls -l
-rw-r--r-- 1 root other 10207716 Jun 26 22:56 base
-rw-r--r-- 1 root other 10207716 Jun 26 23:16 base.out
-rw-r--r-- 1 root other 10207716 Jun 26 23:19 mysort.out
# diff base.out mysort.out
#
RESULTS/CONCLUSIONS
The results are consistent with my original hypothesis. I would
expect an even greater performance boost under a multiprocessor
system, as each process spawned by mysort would be able to run on a
less-busy processor.
*two proc test*
Same as above, run on a two processor box:
# timex sort /tmp/base > /tmp/base.out
real 15.52
user 11.03
sys 1.95
# timex ./mysort > /tmp/mybase.out
real 7.93
user 10.17
sys 2.66
# ls -al *base.out
-rw-r--r-- 1 root other 10207716 Jun 26 23:58 base.out
-rw-r--r-- 1 root other 10207716 Jun 26 23:59 mybase.out
# diff base.out mybase.out
ADDITIONAL INFO/OBSERVATIONS/SUGGESTIONS/CAVEATS
* The tests were conducted on a generic Sun Ultra 10 unix box, single
* The sort algorithm is the 'stock' Solaris 7.0 sort. More
sophisticated sorts (e.g. by gnu) may narrow the gap.
* You may be able to tune the performance slightly by adding/removing
'sort engines' (grep, sort pairs) to the script. This will depend on
your system and your datafile size.
* You should try to balance the size of the intermediate files
(g1,g2,etc) so that each file is as close to possible the same size.
* Note that the script was kept as simple as possible to minimize the
effects of starting up sh, grep, etc.
* Please note that all IO was directed to /tmp. /tmp on Solaris is RAM
(for the most part and depending on individual configurations).
* Also, the above is based on certain assumptions about the content of
the data. mysort would likely have to be tweaked, depending on the
contents of the file to be sorted.
* Note that mysort expects to be using straight ascii. This script
was NOT WRITTEN WITH INTERNATIONAL LANGUAGE SUPPORT IN MIND.
Let the buyer beware. As is. No warranty expressed or implied.
Please test this script thoroughly to make sure it meets your needs.
Your mileage may vary.
Have a nice day,
Dave
----
Here's the mysort script:
#!/bin/sh
grep "^[a-h].*" /tmp/base > /tmp/base.g1 &
grep "^[i-p].*" /tmp/base > /tmp/base.g2 &
grep "^[q-z].*" /tmp/base > /tmp/base.g3 &
wait
sort /tmp/base.g1 > /tmp/base.g1.out &
sort /tmp/base.g2 > /tmp/base.g2.out &
sort /tmp/base.g3 > /tmp/base.g3.out &
wait
cat /tmp/base.g1.out /tmp/base.g2.out /tmp/base.g3.out
+---------------------+---------------+
+---------------------+---------------+