what is fastest way to reformat file from variable to fixed length

what is fastest way to reformat file from variable to fixed length

Post by johnth » Thu, 23 Aug 2001 10:28:10



We need to convert a file from a variable lenght delimted format to a fixed
length format.

Speed is the biggest concern since we have close to 100 gigs of files we want
to change.

For example we have a 24 gig file in 12 pieces.  Currently the file is
variable length delimited by |  with fields enclosed with ^
sample:
^717764002^|^71776401.^|^2000-09-11-19.23.00.000000^|^2000-05-25^
^300102^|^30011.^|^2000-06-28-19.57.29.670634^|^2000-05-30^

we need to convert this to a fixed lenght format with each field taking a
fixed length.  in addtion the 3 rd field needs to be broken into 2 pieces
sample
  717764002   71776401.    2000-09-11   -19.23.00.000000        2000-05-25
       300102         30011.   2000-06-28   -19.57.29.670634        2000-05-30

I know that this can be done in various tools (C, awk, PERL)  couple of
questions:

1 -any opioions experiences with what would be the very fastest method?

2 - any other unix util that would be able to do this?

hardware is Sun ES10k with  8 CPUS,   800 gigs HD over 36 disks, and 6 gigs
RAM. we are only users on machine.

 
 
 

what is fastest way to reformat file from variable to fixed length

Post by Cameron Ke » Thu, 23 Aug 2001 13:53:42


|
| We need to convert a file from a variable lenght delimted format to a fixed
| length format.
|
| Speed is the biggest concern since we have close to 100 gigs of files we want
| to change.
|
|
| For example we have a 24 gig file in 12 pieces.  Currently the file is
| variable length delimited by |  with fields enclosed with ^

[...]

| 1 -any opioions experiences with what would be the very fastest method?
|
| 2 - any other unix util that would be able to do this?
|
| hardware is Sun ES10k with  8 CPUS,   800 gigs HD over 36 disks, and 6 gigs
| RAM. we are only users on machine.

You have what could be termed an "embarrasingly parallel" problem. You
have 12 seperate files, so you could (if possible) delegate this to
mulpiple computers. In addition, you have 8 CPUs on this machines, so I
would suggest a pipelined approach. Process the file in about 8 sections,
with so the computation can be done fully pipelined, and all 8 CPUs should
be used simultaenously.

Whether you use dedicated Unix tools or C (I would suggest Unix text
processing tools, they're optimised for this stuff), the biggest
optimisation will be in the architecture you choose to solve the problem.

--

Website (recently moved) http://homepages.paradise.net.nz/~cameronk/
--

 
 
 

what is fastest way to reformat file from variable to fixed length

Post by Chuck Dillo » Thu, 23 Aug 2001 23:02:34



> We need to convert a file from a variable lenght delimted format to a fixed
> length format.

> Speed is the biggest concern since we have close to 100 gigs of files we want
> to change.

> For example we have a 24 gig file in 12 pieces.  Currently the file is
> variable length delimited by |  with fields enclosed with ^
> sample:
> ^717764002^|^71776401.^|^2000-09-11-19.23.00.000000^|^2000-05-25^
> ^300102^|^30011.^|^2000-06-28-19.57.29.670634^|^2000-05-30^

> we need to convert this to a fixed lenght format with each field taking a
> fixed length.  in addtion the 3 rd field needs to be broken into 2 pieces
> sample
>   717764002   71776401.    2000-09-11   -19.23.00.000000        2000-05-25
>        300102         30011.   2000-06-28   -19.57.29.670634        2000-05-30

> I know that this can be done in various tools (C, awk, PERL)  couple of
> questions:

> 1 -any opioions experiences with what would be the very fastest method?

> 2 - any other unix util that would be able to do this?

> hardware is Sun ES10k with  8 CPUS,   800 gigs HD over 36 disks, and 6 gigs
> RAM. we are only users on machine.

It seems to me you are going to be majorly I/O bound.  It probably won't make
any difference which tool you use.  Assuming you use a tool that can do the operation
in one step, i.e. one I/O cycle.  I would suggest either awk or perl.  You could
also do it with sed but in a less elegant way.

I think your strategy should be to distribute jobs across disk drives.  IOW,
run separate jobs for each disk to the extent possible.

HTH,

-- ced

--
Chuck Dillon
Senior Software Engineer
Accelrys Inc., a subsidiary of Pharmacopeia, Inc.

 
 
 

what is fastest way to reformat file from variable to fixed length

Post by Alex Colvi » Fri, 24 Aug 2001 00:20:59


you know, i bet you could have written a perl script and be done by
now...
 
 
 

what is fastest way to reformat file from variable to fixed length

Post by johnth » Wed, 29 Aug 2001 23:42:15


Perl is way too slow by an order of magnitude vs C.  We have to go thru close
to 100 gigs of files in less than 24 hours.


>you know, i bet you could have written a perl script and be done by
>now...

 
 
 

what is fastest way to reformat file from variable to fixed length

Post by RaRa Rasput » Thu, 30 Aug 2001 00:15:30


In the last exciting episode of comp.unix.programmer,
johnthan said:

Quote:

> Perl is way too slow by an order of magnitude vs C.  We have to go thru close
> to 100 gigs of files in less than 24 hours.

If you've tried it and it was too slow, fair enough.
But 100 Gb in 24 hours sounds incredibly slow from my experience of Perl
regex speed....


>>you know, i bet you could have written a perl script and be done by
>>now...

--
Rasputin :: Jack of All Trades - Master of Nuns ::
 
 
 

what is fastest way to reformat file from variable to fixed length

Post by Eric Sosma » Thu, 30 Aug 2001 00:29:17



> Perl is way too slow by an order of magnitude vs C.  We have to go thru close
> to 100 gigs of files in less than 24 hours.

    100e9 bytes in 24 hours is less than 1.2e6 bytes per second.
Your output format looks to be more voluminous than your input,
so figure maybe 1.5 times as much data written as read -- you're
still well below 3 megabytes per second overall.  As long as the
input and output aren't on the same disk, this data rate isn't
strenuous at all.

    I suggest coding up a solution in Perl or whatever takes
your fancy, running it on a subset of your data, and measuring
the achieved throughput.  Don't work harder than you must until
it's actually demonstrated that the extra work is needed.

--

 
 
 

what is fastest way to reformat file from variable to fixed length

Post by Joe Schaefe » Thu, 30 Aug 2001 01:07:13



> Perl is way too slow by an order of magnitude vs C.  We have to go
> thru close to 100 gigs of files in less than 24 hours.

Then try writing a better Perl script. Did your script fork some
children to process the disks in parallel, as was suggested
earlier?  What was your CPU utilization while the script was
running?  What did you use to process each line?

Here's some untested ideas for line processing.  Instead of regexps
you could try using index and substr and see which is faster.

  my $format = "A8A6A12A14"; # whatever

  $\="\n";

  while (<>) {
    chomp;
    s/(\d+ - \d+ - \d+)/$1^|^/x;                     # break up the date




  }

If you decide to pursue this approach, perhaps crossposting
to clp.misc would be a good idea.

--
Joe Schaefer

 
 
 

what is fastest way to reformat file from variable to fixed length

Post by Alex Colvi » Thu, 30 Aug 2001 02:07:55


how long has this discussion been going on?
Quote:> > Perl is way too slow by an order of magnitude vs C.  We have to go thru close
> > to 100 gigs of files in less than 24 hours.

 
 
 

1. Splitting a variable number of fixed length fields

You are sure to run into line length limits with generic awk or sed,
as you have a possible length of 7560 bytes total.  HP awk chokes after
3000 bytes.

gnu awk doesn't seem to care.  This seems simple, maybe I missed the point?:

gawk '{print substr($0,1,120);
       for (i=121;i<length;i+=60) print substr($0,i,60);
      }' file

2. Crack 5 HELP!

3. Ways to test on string length

4. Problem with SMC Etherpower II + kernel newer 2.4.2

5. Fixed length records containing 2 different records types with fixed field widths

6. CD writing under Linux?

7. Filtering Variable length files

8. Customizing GNOME

9. sort fixed length file

10. writing (large) *fixed-length* files (to tape) using 'dd', et al

11. sorting fixed length records/fields (large files)

12. World's Fastest Video Card now on the World's Fastest PC