What's the theoretical max savings on binary diffs?

What's the theoretical max savings on binary diffs?

Post by Jim Kingdo » Wed, 02 Dec 1998 04:00:00



Quote:> 1) What is the THEORETICAL best likely binary diff savings over
> simply saving the whole thing every time? Assume a medium (not more
> than 10% but not trivial either) difference in bytes, but the
> differences are in random places, in a random application domain.

In that situation seems like the diff could always be under
approximately 20% of the size of the file (10% for the differing
bytes, 12.5% for a bitmap saying which bytes differ).  Although that
may be for a definition of "differing bytes" which isn't what you have
in mind, my intuition is that for some plausible definition of
differing bytes, the result would be in that ballpark.

Quote:> I know there are those that have experienced binary diff situations
> where the result files were much LARGER than the original two files!
> However in theory this should not happen, since you could simply
> have a flag included, that if the result is larger, do not use the
> result, simply use the original file plus flag: the file ends up
> only a few bytes larger.

I would think that any good binary diff algorithm would include some
such mechanism (probably per-block rather than per-file), but I don't
know whether the common algorithms like xdelta do or not.
 
 
 

What's the theoretical max savings on binary diffs?

Post by Eivind Hage » Thu, 03 Dec 1998 04:00:00


I have some experience with compression algorithms, specifically string
matching algorithms.
It seems that if you treat the original (older version) of the file as a
databank/dictionary for matching strings with, you should be able to to
do pretty good. If a large string of bytes is the same, then just
indicate where it is in the old version and don't copy it. The remaining
(different) bytes would need to be stored anyway, and represents the
actual diff data.

Does anyone know if this has been attempted yet?

-Eivind Hagen, Intermetrics Inc.

 
 
 

What's the theoretical max savings on binary diffs?

Post by DI Michael Schindle » Thu, 03 Dec 1998 04:00:00


hi!

You can do in about 16% of the filesize for 10% dandom differences
at random places.
You need 10% for the new values (you could use exclusion squeezing
out a tiny amount more) and 5,9% (0.49 bit/byte) for n arithmetic
coded field to flag changed bytes.
If you code places of changes directly or with huffman codes you have to
act on groups of bytes, having one code for "group unchanged"

There are various optimizations for speed possible as how to exactly
store the data; you should (and can easily) avoid decoding something
for each unchanged byte.

regards
Michael
--
Michael Schindler   compression consultant



Quote:> > 1) What is the THEORETICAL best likely binary diff savings over
> > simply saving the whole thing every time? Assume a medium (not more
> > than 10% but not trivial either) difference in bytes, but the
> > differences are in random places, in a random application domain.

> In that situation seems like the diff could always be under
> approximately 20% of the size of the file (10% for the differing
> bytes, 12.5% for a bitmap saying which bytes differ).  Although that
> may be for a definition of "differing bytes" which isn't what you have
> in mind, my intuition is that for some plausible definition of
> differing bytes, the result would be in that ballpark.

> > I know there are those that have experienced binary diff situations
> > where the result files were much LARGER than the original two files!
> > However in theory this should not happen, since you could simply
> > have a flag included, that if the result is larger, do not use the
> > result, simply use the original file plus flag: the file ends up
> > only a few bytes larger.

> I would think that any good binary diff algorithm would include some
> such mechanism (probably per-block rather than per-file), but I don't
> know whether the common algorithms like xdelta do or not.

 
 
 

What's the theoretical max savings on binary diffs?

Post by Josh MacDonal » Thu, 03 Dec 1998 04:00:00



> Greetings all:

> Got a two-parter for ya.  I know this has been gone over a few times, but I
> havent seen it lately, cant remember the answer, and cant find it in dejanews.

> 1) What is the THEORETICAL best likely binary diff savings over simply saving
> the whole thing every time? Assume a medium (not more than 10% but not trivial
> either) difference in bytes, but the differences are in random places, in a
> random application domain.

The phrase "theoretical best likely" doesn't make much sense, and the
answer is that there are two problems to solve, so you can mix and
match your theoretical bounds.  There is a pattern matching problem
and then an encoding problem.  Many of the approximate binary-diff
algorithms I've seen (including Xdelta, which I wrote), limit certain
parameters so that it is impossible to end up with a delta which is
larger than the destination file, except for a header which is
independent of the encoding.  For common encodings, the answer to your
question also depends on the number of fragments present on the
original file.  

Quote:> I know there are those that have experienced binary diff situations where the
> result files were much LARGER than the original two files!  However in theory
> this should not happen, since you could simply have a flag included, that if
> the result is larger, do not use the result, simply use the original file
> plus flag: the file ends up only a few bytes larger.

> 2) That said, are there any good binary diff algorithms or software that do
> not experience the problems of much LARGER result files on occasion?  I would
> prefer software that is free or at least free to try out.

Give XDelta a try, at http://www.xcf.berkeley.edu/~jmacd/xdelta.html.

-josh

 
 
 

What's the theoretical max savings on binary diffs?

Post by Matt Kenn » Sun, 06 Dec 1998 04:00:00


:I have some experience with compression algorithms, specifically string
:matching algorithms.
:It seems that if you treat the original (older version) of the file as a
:databank/dictionary for matching strings with, you should be able to to
:do pretty good. If a large string of bytes is the same, then just
:indicate where it is in the old version and don't copy it. The remaining
:(different) bytes would need to be stored anyway, and represents the
:actual diff data.
:
:Does anyone know if this has been attempted yet?

What about

diff --minimal

and passing the remainder through an entropy coder?

:-Eivind Hagen, Intermetrics Inc.

--
*        Matthew B. Kennel/Institute for Nonlinear Science, UCSD          
*
*          "do || !do;  try: Command not found"
*                                         /sbin/yoda --help

 
 
 

1. What is the theoretical max efficiency for run-length encoding?

I have many files of indefinite length, all of the bytes being composed
of $F's. Looks like run-length encodind is in order. What is the best
theoretical compression I can achieve? Keep in mind that I don't need
to worry about escape characters in the data stream, as the entire
stream is being encoded.

According to IEEE Spectrum, August 93, p. 36. Claude E. Shannon came up
with a nice equation in the 1940's to determine the degree to which a
message can be compressed. Something along the lines of  -log b2
(entropy). Does this equation hold true for files with such low
entropies as the ones described above (or do my files have an entropy
of 1)? I guess the real question here is, does the length of a file
have anything to do with the file's entropy?

Any and all help appreciated.
Thanks

Phone: (612) 457-5478      
  _____________________________   ______________________________
         Stephen E. Maas           Whose distant footsteps echo


         ==============                  ==================
Snail: 1271 Cherokee Avenue, West Saint Paul, MN   55118-2005

2. wtb: cd32

3. Theoretical Max Revisited

4. Attn Developers/Testers

5. theoretical power of Vim's syntax patterns?

6. FREE Computer Equipment

7. Planet PDF's 'Binary Day 01' postcards in PDF

8. Background Picture

9. LONG DISTANCE, LOCAL TELCOM SAVINGS FOR BUSINESS'S ONLY-TOP 4 CARRIER

10. binary diff then send diffrences??

11. A diff/patch utility for binary files ?

12. Binary diffs ("forward compression")

13. bash 1.10 for Minix-386, diffs & binary, part 1/7