Combining large files

Combining large files

Post by Salman Mogha » Sun, 29 Jun 2003 16:29:55



I have several 1M sized files (around 10000) of them that make up one large,
20G tar file.  I would like to combine them all.  So I started with a simple
script to do a "cat" on each file and combine it with the next one in
series.  That process seems to work but it's EXTREMLY slow.

In DOS, it's posbile to copy files as follows, where bulk of the work is
done by the copy command itself:

copy file1+file2+file3 new_file

There is no need to concatenate individual files.  On my board,
concatenating a combined 300M file with another 1M, eg, takes about 5
minutes.. not great performance.  So I am wondering if there is a utlity
similar utility out there for Linux that supports the DOS copy feature.

I am really trying to avoid writing a C program to accomplish this task :)

TIA
Salman

 
 
 

Combining large files

Post by jbuch.. » Sun, 29 Jun 2003 22:58:30



Quote:> I have several 1M sized files (around 10000) of them that make up one large,
> 20G tar file.  I would like to combine them all.  So I started with a simple
> script to do a "cat" on each file and combine it with the next one in
> series.  That process seems to work but it's EXTREMLY slow.
> In DOS, it's posbile to copy files as follows, where bulk of the work is
> done by the copy command itself:
> copy file1+file2+file3 new_file

How about:

cat file1 file2 file3 > new_file

That's a pretty odd "Followup-To:" you have there, the same two groups
you crossposted to. No harm I suppose, but odd.

--

=================== http://www.buchanan1.net/ ==========================
"...if you log off now you may even discover that there is a whole
 world of amazing life outside the Internet. I don't know about that
 myself so I can't describe it to you but I've downloaded pictures of
 it." -Kurt Gray
================= Visit: http://www.thehungersite.com ==================

 
 
 

Combining large files

Post by Shadow_ » Mon, 30 Jun 2003 04:14:24


Quote:> cat file1 file2 file3 > new_file

If they're numbered sequentially, you could get away with:

cat file* >new_file

This assumes the names are 01 to 20 and NOT 1 to 20.  Since the default
sort is by ASCII sequence.  Assuming an ASCII based platform and other
defaults are in place.  I join mpg and other files this way all the time.
This also assumes the new_file name differs enough from the file? name
that it doesn't get included in the cat portion.  And that no other
extraneous files get grabbed by the wild card.

Otherwise DOS's:
copy file1/b+file2/b+file3/b new_file

under linux is roughly equal to:
cat file1 file2 file3 >new_file

You do NOT need to step it up like this:
cat file1 file2 >new_file1
cat new_file1 file3 >new_file2
cat new_file2 file4 >new_file1
cat new_file1 file5 >new_file
rm new_file1 new_file2
That would be very wasteful and slow.  But I've known a few *s in my
day who enjoyed typing and would do it that way.

One other limitation of sorts is that 32 bit processors are likely to
limit the maximum size of files to 2.1G.  So you may be limited in only
forming your 20G file on a x86-64 or other 64+ bit platform.

HTH,

Shadow_7

 
 
 

Combining large files

Post by David Utidjia » Mon, 30 Jun 2003 08:37:02


Interesting problem.

Why is it useful for you to have 10,000 1M files rolled into one huge
file?

-DU-...etc...

 
 
 

Combining large files

Post by Salman Mogha » Mon, 30 Jun 2003 10:23:17


I had over 20G of data that I wanted to backup before I re-build my linux
machine.  Due to disk space constraints, I tarred it all up and then
compressed it using bzip2.  It shrank down to around 11G file.  Before
storing the data file, I verified bzip2 integrity using bzip2 utility.
There are other alternatives to doing a more reliable backup, but this
process seemed pretty fast and simple.

Everything was fine until I copied the data file back and tried
decompressing it.  bzip2 utility complained about CRC errors.  Hence I used
bzip2recover to recover undamanged bzip2 blocks.   Uusually those blocks
(depending on how they are intially set when creating bz2 compressed file)
are 900K chunks.  bzip2recover utility apparently receovered all 13500~
blocks, and stored them in 900K sized bz2-format files.

So in order to get the orignal 20G tar file, I started unziping each
compressed block and then combining the resulting data files together.  That
process is quite lenghty and taking a long time.


Quote:> Interesting problem.

> Why is it useful for you to have 10,000 1M files rolled into one huge
> file?

> -DU-...etc...

 
 
 

Combining large files

Post by David Utidjia » Mon, 30 Jun 2003 13:11:51



> I had over 20G of data that I wanted to backup before I re-build my
> linux machine.  Due to disk space constraints, I tarred it all up and
> then compressed it using bzip2.  It shrank down to around 11G file.
> Before storing the data file, I verified bzip2 integrity using bzip2
> utility. There are other alternatives to doing a more reliable backup,
> but this process seemed pretty fast and simple.

> Everything was fine until I copied the data file back and tried
> decompressing it.  bzip2 utility complained about CRC errors.  Hence I
> used bzip2recover to recover undamanged bzip2 blocks.   Uusually those
> blocks (depending on how they are intially set when creating bz2
> compressed file) are 900K chunks.  bzip2recover utility apparently
> receovered all 13500~ blocks, and stored them in 900K sized bz2-format
> files.

> So in order to get the orignal 20G tar file, I started unziping each
> compressed block and then combining the resulting data files together.
> That process is quite lenghty and taking a long time.

Hmmmm... I think see your problem... it basically boils down to not
having enough of the right kind of storage media and/or a solid tested
plan for using it.

I apologize in advance if some of what follows sounds harsh or uncaring
but when it comes to solid backup plans, as you will learn, there is zero
room for error... and by extension... very little room for kindness and
understanding.

With that said... I think I do understand the position you are in. I hope
for your sake that your livelihood does not depend on the recovery of all
these files.

Are/were all these files located in a single subdirectory off of / ? Perhaps /home or
/var? If a single subdirectory was it also a separate partition? If you
had kept the data in its own subdir on its own partition you could have avoided
the neccessity of moving it off of the current media in the first place.

What version of bzip2 are you using? According to the manpage versions
1.0.1 and earlier have a limit of 512MBytes for file size. This
restriction is removed with version 1.0.2. Not sure what the max is after
that.

Also according to the manpage the way to restore the original file after
completing a successful bzip2recover is to do this:

bzip2 -dc rec*file.bz2 > recovered_data

Does that work? If so then, I guess you can untar the recovered_data
file.

In the future... you should consider a more robust backup plan. If your
data is valuable to you and/or your employer then you should consider
getting,at the very least, one or more backup disks so that the data can be
mirrored. Even better... get a good tape backup system. I have had very
good luck with DLT tapes and drives. I have had very bad luck with
DAT/DDS tapes and drives. A DLT tape system can handle up to 40/80G of
data and 200+G in the SuperDLT drives.

Having a good (and tested) backup plan means never having to say you are
sorry.

-DU-...etc...

 
 
 

Combining large files

Post by Kenneth A Kauffma » Wed, 02 Jul 2003 04:58:03




> > I had over 20G of data that I wanted to backup before I re-build my
> > linux machine.  Due to disk space constraints, I tarred it all up and
> > then compressed it using bzip2.  It shrank down to around 11G file.
> > Before storing the data file, I verified bzip2 integrity using bzip2
> > utility. There are other alternatives to doing a more reliable backup,
> > but this process seemed pretty fast and simple.

> > Everything was fine until I copied the data file back and tried
> > decompressing it.  bzip2 utility complained about CRC errors.  Hence I
> > used bzip2recover to recover undamanged bzip2 blocks.   Uusually those
> > blocks (depending on how they are intially set when creating bz2
> > compressed file) are 900K chunks.  bzip2recover utility apparently
> > receovered all 13500~ blocks, and stored them in 900K sized bz2-format
> > files.

> > So in order to get the orignal 20G tar file, I started unziping each
> > compressed block and then combining the resulting data files together.
> > That process is quite lenghty and taking a long time.

> Hmmmm... I think see your problem... it basically boils down to not
> having enough of the right kind of storage media and/or a solid tested
> plan for using it.

> I apologize in advance if some of what follows sounds harsh or uncaring
> but when it comes to solid backup plans, as you will learn, there is zero
> room for error... and by extension... very little room for kindness and
> understanding.

> With that said... I think I do understand the position you are in. I hope
> for your sake that your livelihood does not depend on the recovery of all
> these files.

> Are/were all these files located in a single subdirectory off of / ?
Perhaps /home or
> /var? If a single subdirectory was it also a separate partition? If you
> had kept the data in its own subdir on its own partition you could have
avoided
> the neccessity of moving it off of the current media in the first place.

> What version of bzip2 are you using? According to the manpage versions
> 1.0.1 and earlier have a limit of 512MBytes for file size. This
> restriction is removed with version 1.0.2. Not sure what the max is after
> that.

> Also according to the manpage the way to restore the original file after
> completing a successful bzip2recover is to do this:

> bzip2 -dc rec*file.bz2 > recovered_data

> Does that work? If so then, I guess you can untar the recovered_data
> file.

> In the future... you should consider a more robust backup plan. If your
> data is valuable to you and/or your employer then you should consider
> getting,at the very least, one or more backup disks so that the data can
be
> mirrored. Even better... get a good tape backup system. I have had very
> good luck with DLT tapes and drives. I have had very bad luck with
> DAT/DDS tapes and drives. A DLT tape system can handle up to 40/80G of
> data and 200+G in the SuperDLT drives.

> Having a good (and tested) backup plan means never having to say you are
> sorry.

> -DU-...etc...

Not everyone here is a corporate admin.  A limited or low budget home
solution does not warrant the backup methods you describe unless information
is critical.  This sounds like a home user with "important" information that
needed to be recovered.

Consider that the original question was simply how to combine files a faster
way.

That being said, the advice is warranted only for perspective.

ken k

 
 
 

1. combine two large existing files

Hello,

I am looking forward a example source code by C to combine two existing
large files without moving data from one sector to the other for hight
performance base on embedded Linux platform. I think it is a quick way to
modify file systems table directly but it seems danger and difficult. Any
comments are welcome.

best regards,
Gary

2. radeon and linux

3. Combine Three Data files onto one file:

4. Frame size on ethernet....

5. how to combine some .o files to one .a file??

6. veritas / resize the rootvol-volume

7. How can i combine 2 files contents to a new file.

8. Lynx-2.5 for ELF on sunsite is a dissapointment.

9. Help on Unix KShell to combine serveral files into one file

10. Library management tools for large number of large data files.

11. Large Shared Memory, large files

12. "Standard Journaled File System" vs "Large File Enabled Journaled File System"

13. File corruption accessing files on a large-file-enabled fs using RM-Cobol