This is _really_ _really_ _really_ hard. If you manage to come up withQuote:>looking for ideas in an optimal solution toward solving a problem in

>cd-r backups (using linux, initially at least; maybe more o/s' after

>the general algorithm is worked out and implemented).

>the idea is to take a large hard disk (or partition) and break it up

>into directories and/or files and having the collection of these

>objects span several cd-r's when making a backup. and minimizing

>wasted trailing space on each cd-r in the saveset.

>initially I'm thinking of taking a dir listing and sorting it by file

>size (largest -> smallest). then figuring out how many cd-r's it

>would take to cover that saveset (rough guess). perhaps next I'll

>round-robin save the largest files going from cd-r to cd-r (idea being

>that once you've placed the largest files on each of the cd-r's, the

>smaller ones will be easier to figure out where to place, in the

>spaces on the discs that are unclaimed).

>in the end, I'd like to have the smallest # of cd-r's with the least

>amount of empty (wasted) space on each.

a solution then make sure you publish it.

You are trying to solve the 'Minimum Bin Packing' problem.

Here it is more formally :

Given a finite set U of items (files on the disk in this case), a size for

each (size of the file in this case), and a positive integer bin capacity B

(capacity of the cd-r media in this case), give the minimum number of

bins required to pack all of U.

The very bad news is that this problem is in NP. And thus not solvable in

reasonable time with any known algorithm. Approximations exist, I don't

know them off the top of my head but here are some references :

New worst-case results for the bin-packing problem

Simchi-Levi, D. (1994),

Naval Res. Logistics 41, 579-585.

A simple proof of the inequality 'MFFD(L) <= 71/60 OPT(L) + 78/71, all L' for

the MFFD bin-pack algorithm

Yue, M. (1991),

Technical Report RRR # 20-91, Rutcor, Rutgers Center for Operations

Research, Rutgers University, New Jersey.

A 71/60 theorem for bin-packing

Johnson, D. S., and Garey, M. R. (1985),

J. Complexity 1, 65-106.

Algorithm text books should have simplistic algorithms that make sure

you get a reasonable packing. Your proposed solutions is one of those, and

it is reasonable - espicially since cd-r media is cheap anyway.

If you allow files to be split into pieces then it is trivial toQuote:>for files that are larger than a single cd-r, they will obviously have

>to be split. that's another problem - what size and how many chunks

>to split them into? first idea: take the dumb approach and divide

>them by the media size so that you fill each cd-r with as much of the

>file as possible then span to the next one, etc. but there are

>probably better ideas that would allow better optimization since I

>think the larger the chunks to place, the harder it will be to fill

>all the gaps and leave the least amount of wasted space (just guessing

>- maybe I'm wrong).

>I'm not so much concerned with restore or backup speed, just

>minimizing wasted space so that archiving will take the minimal amount

>of physical discs.

get optimal packing. You just add the files one by one, and when you

run out of room, you add the rest of the file onto the next disk.

This doesn't effect back-up too much. It makes restoring a bit more tedious if

the file you wish to restore happens to be split across media.

Backing up win files is fraught with danger due to the stupidity of filesQuote:>so, any ideas? ;-) I've not seen this done before (for freeware, at

>least) and I'd like to make an opensource linux cd-r backup util based

>on this idea. and since linux can read dos/win partitions, its

>conceivable that a bootable linux ramdisk image could be used to

>backup/restore dos/win files..

with two names, where the shorter names are stored in system registries on

occasion.

Is just writing a tar file to the CD reasonable - with no file-system? I don'tQuote:>any ideas on this subject?

know if that would work, but it should give you the maximum amount of space

on the cd-r. Plus you then have a tar file to restore from, so existing

software should grok it.

Spliting the files amongst the cd-r's is still a problem that has to be solved.

But adding using the largest-fit heuristic you mentioned above should

give reasonable results, since there are (usually) lots of small files to pad

with.

--

Sam

Simple rule: include files should never include include files.

--Rob Pike