JPEG machine-independent optimization questions for IJG 6a/6b

JPEG machine-independent optimization questions for IJG 6a/6b

Post by Matt Posti » Wed, 07 May 2003 06:16:44



Looking at the 6a source, and also at the 6b, there is one area that
seems
to this newbie user that it could improve, and one which I want to ask
about.

First, In jcphuff.c::encode_mcu_AC_refine there are two loops that run
through the block. The first one is used to modify the coefficients a
little bit and to establish EOB. Then the next loop looks for runs of
zeros and emits ZRLs as appropriate in an inner while loop. It seems
that
these two outer for loops can be combined. The observation is that the
condition on k in

        while (r > 15 && k <= EOB) {

is only useful to avoid emitting the last run of zeros. But by the way
the
second loop is constructed, the last run of zeros is added up (r++
runs a
bunch of times...) and then the loop is exited, so you never actually
use
the k <= EOB condition on the while. This allows the two loops to be
smashed together, the absvalues[] array is eliminated, and, in my test
case, to shave off 10% of the dynamic instructions of the program.

I wondered if any one can comment on the correctness of this
optimization.

Second, why is jpeg_natural_order used in many places instead of
rearranging the order of elements at some other point? The way it is
creates a lot of extra memory loads to establish indexes into
(*block), which always seem to use the jpeg_natural_order index
rearrangement. I wondered if
rearranging once coming in and going out would be better than using
this
technique throughout the center of the algorithm.

Thanks,

Matt
Can you also send any reply to postiffm at umich dot edu?

 
 
 

JPEG machine-independent optimization questions for IJG 6a/6b

Post by Guido Vollbedin » Sun, 11 May 2003 03:48:08



> First, In jcphuff.c::encode_mcu_AC_refine there are two loops that run
> through the block. The first one is used to modify the coefficients a
> little bit and to establish EOB.

The principal purpose of the first loop is to establish the EOB index.
The determination of the absvalues of the coefficients is not so
important here, you can well leave that out and determine the
absvalue of the coefficient "on the fly" in both loops, like
I do in my corresponding arithmetic coding implementation
(see http://sylvana.net/jpeg-ari/uncompressed/jcarith.c ).

Quote:> Then the next loop looks for runs of
> zeros and emits ZRLs as appropriate in an inner while loop. It seems
> that
> these two outer for loops can be combined. The observation is that the
> condition on k in

>         while (r > 15 && k <= EOB) {

> is only useful to avoid emitting the last run of zeros. But by the way
> the
> second loop is constructed, the last run of zeros is added up (r++
> runs a
> bunch of times...) and then the loop is exited, so you never actually
> use
> the k <= EOB condition on the while. This allows the two loops to be
> smashed together, the absvalues[] array is eliminated, and, in my test
> case, to shave off 10% of the dynamic instructions of the program.

I'm afraid you have overlooked something.  Look at the comment below
the "while (r > 15 && k <= EOB)" construct:  "... But if r > 15, we can
only get here if k > EOB, ...".  This suggests that the case "k > EOB"
may occur here, and is necessary to consider for the following correction
bit designation.  This correction bit code is *not* present after loop
exit, therefore it appears necessary to handle inside the loop, with
prior knowledge of the EOB index.

Quote:> I wondered if any one can comment on the correctness of this
> optimization.

I don't think there is reason for noticeable optimization in this part.
The prior loop for EOB determination is very inexpensive.
If you look at my arithmetic code above, you will notice even *two*
prior loops in "encode_mcu_AC_refine":  One to establish actual EOB,
and one to establish previous stage EOB (EOBx).  (But the second loop
starts at actual found EOB position from first loop, so it's still
just one pass really.)
I don't see any performance hog here.

Quote:> Second, why is jpeg_natural_order used in many places instead of
> rearranging the order of elements at some other point? The way it is
> creates a lot of extra memory loads to establish indexes into
> (*block), which always seem to use the jpeg_natural_order index
> rearrangement. I wondered if
> rearranging once coming in and going out would be better than using
> this
> technique throughout the center of the algorithm.

Actually, it *is* in the expected arrangement:  Two-dimensional block
coefficient order in the memory block array, and only rearranged for
going out (in the entropy encoder, sic!) and coming in (entropy decoder).
Since you just look at the entropy encoder, you see the corresponding
rearrangement "in so many cases", but still just in the entropy encoder
(it has so many cases just in the progressive mode).
But this is *not* the "center of the [JPEG] algorithm".  The center
of the JPEG algorithm is elsewhere (in the DCT and quantization modules),
and there you work with the usual 2D block arrangement.
Or look at the lossless transformation code (transupp.c) which operates
directly with DCT coefficient arrays - in usual 2D arrangement, of
course - otherwise it would be even more cumbersome as it actually
is ;).

Regards
Guido

 
 
 

JPEG machine-independent optimization questions for IJG 6a/6b

Post by Matt Postif » Sun, 11 May 2003 05:36:54




> > First, In jcphuff.c::encode_mcu_AC_refine there are two loops that run
> > through the block. The first one is used to modify the coefficients a
> > little bit and to establish EOB.

> The principal purpose of the first loop is to establish the EOB index.
> The determination of the absvalues of the coefficients is not so
> important here, you can well leave that out and determine the
> absvalue of the coefficient "on the fly" in both loops, like
> I do in my corresponding arithmetic coding implementation
> (see http://sylvana.net/jpeg-ari/uncompressed/jcarith.c ).

I see that.

Quote:> > Then the next loop looks for runs of zeros and emits ZRLs as
> > appropriate in an inner while loop. It seems that these two outer for
> > loops can be combined. The observation is that the condition on k in

> >         while (r > 15 && k <= EOB) {

> > is only useful to avoid emitting the last run of zeros. But by the way
> > the second loop is constructed, the last run of zeros is added up (r++
> > runs a bunch of times...) and then the loop is exited, so you never
> > actually use the k <= EOB condition on the while. This allows the two
> > loops to be smashed together, the absvalues[] array is eliminated,
> > and, in my test case, to shave off 10% of the dynamic instructions of
> > the program.

> I'm afraid you have overlooked something.  Look at the comment below
> the "while (r > 15 && k <= EOB)" construct:  "... But if r > 15, we can
> only get here if k > EOB, ...".  This suggests that the case "k > EOB"
> may occur here, and is necessary to consider for the following correction
> bit designation.  This correction bit code is *not* present after loop
> exit, therefore it appears necessary to handle inside the loop, with
> prior knowledge of the EOB index.

Thanks for your response. Maybe you'll have to explain this from a
different angle for me to "get it." I have three responses to your answer.
First, k > EOB is different than checking k <= EOB in the while loop
control. In the comment you cite, the portion you quoted is only used to
prove that checking temp > 1 is sufficient, not to prove that you need to
check k against EOB. Perhaps the comment needs adjustment? It should read
something like this (addition in caps):

     * that we also need to test r > 15. But if r > 15, we can only get here
     * if k > EOB, which implies that this coefficient is not 1.
     * IN FACT, WE CANNOT REACH HERE IF k > EOB BECAUSE OF THE CONTINUE
     * STATEMENT ABOVE.

Second, it seems to me that the condition k > EOB is equivalent to "we are
in the last run of all zeros"  and therefore the final zero encoding will
cover all those--none of the other correction code would ever run. In
other words, every time you reach the while loop, k <= EOB, for if k >
EOB, everything beyond that point is all zeros, the r++ would run a bunch
of times, and then you would exit the for loop entirely, never reaching
the while loop. Looking at the edge case, if k == EOB and you have a
nonzero case, you will fall through to the while, execute it if r > 15,
and then accumulate all the final zeros and not touch the while again.

Third, the comment above the first loop says "It is convenient to make a
pre-pass..." I assume "convenient" does not mean "necessary", so I was
looking for a way to optimize the loops together, assuming that it was
indeed possible.

Quote:> > I wondered if any one can comment on the correctness of this
> > optimization.

> I don't think there is reason for noticeable optimization in this part.
> The prior loop for EOB determination is very inexpensive.
> If you look at my arithmetic code above, you will notice even *two*
> prior loops in "encode_mcu_AC_refine":  One to establish actual EOB,
> and one to establish previous stage EOB (EOBx).  (But the second loop
> starts at actual found EOB position from first loop, so it's still
> just one pass really.)
> I don't see any performance hog here.

I think it _is_ quite a source of slowdown--The first loop and the top of
the second loop (through the continue statement) account for 20% of the
instructions run for a complete encode of a small image. This is because
the code walks through every entry in presumably every block--and when you
have two loops to do it instead of merging it into one, you have to touch
memory twice as many times--loads and stores are expensive. Perhaps you
can run a profile on the code and see if the most concentrated part of the
execution falls in this same function, or somewhere else.

- Show quoted text -

Quote:> > Second, why is jpeg_natural_order used in many places instead of
> > rearranging the order of elements at some other point? The way it is
> > creates a lot of extra memory loads to establish indexes into
> > (*block), which always seem to use the jpeg_natural_order index
> > rearrangement. I wondered if rearranging once coming in and going out
> > would be better than using this technique throughout the center of the
> > algorithm.

> Actually, it *is* in the expected arrangement:  Two-dimensional block
> coefficient order in the memory block array, and only rearranged for
> going out (in the entropy encoder, sic!) and coming in (entropy decoder).
> Since you just look at the entropy encoder, you see the corresponding
> rearrangement "in so many cases", but still just in the entropy encoder
> (it has so many cases just in the progressive mode).
> But this is *not* the "center of the [JPEG] algorithm".  The center
> of the JPEG algorithm is elsewhere (in the DCT and quantization modules),
> and there you work with the usual 2D block arrangement.
> Or look at the lossless transformation code (transupp.c) which operates
> directly with DCT coefficient arrays - in usual 2D arrangement, of
> course - otherwise it would be even more cumbersome as it actually
> is ;).

Yes. Thanks,

Matt

- Show quoted text -

Quote:> Regards
> Guido

 
 
 

JPEG machine-independent optimization questions for IJG 6a/6b

Post by Guido Vollbedin » Mon, 12 May 2003 00:36:53



> Thanks for your response. Maybe you'll have to explain this from a
> different angle for me to "get it."

OK, here is an example:
Assume that zz[k=20] = 1; (zz = zigzag coeff)
            zz[k=21...39] = 0;
            zz[k=40] = 2;
            zz[k=41...63] = 0;
            Al = 0;
            Se = 63 (full AC band).
Now note that EOB = 20, *not* 40, since EOB is defined here as last
*newly nonzero* coefficient.  zz[40] = 2 = binary_10 is *previously*
nonzero, thus doesn't count as EOB.
If we now come to k=40 in the loop, we have r=19 > 15 *and* k=40 >
EOB (=20).  Thus this is *not* the last run of zeros, we don't
escape the loop, and we still have to code the correction bit for
zz[40] = 2.  But we avoid emitting a ZRL by the while construct,
since we are *after* EOB per definition and it thus can be folded
into EOBRUN.
If you would leave off the k <= EOB condition in the while, you
would emit an obsolete ZRL.
It appears that your confusion comes from the different EOB definition
in the progressive mode:  It is here not just the last nonzero
coefficient (after point transform by Al, as counted with zero run r),
but the last *newly* nonzero coefficient (absvalue=1 after point
transform by Al).  Thus there's not the straight relationship between
EOB and zero run r as in the sequential mode.

Quote:> I think it _is_ quite a source of slowdown--The first loop and the top of
> the second loop (through the continue statement) account for 20% of the
> instructions run for a complete encode of a small image. This is because
> the code walks through every entry in presumably every block--and when you
> have two loops to do it instead of merging it into one, you have to touch
> memory twice as many times--loads and stores are expensive. Perhaps you
> can run a profile on the code and see if the most concentrated part of the
> execution falls in this same function, or somewhere else.

Well, you can modify the code somewhat according to my arithmetic coding
example:
1. Remove the absvalues array and calculate the absvalues on the fly.
2. Step through the array backwards from the end to find the EOB:
   for (EOB = cinfo->Se; EOB > 0; EOB--) {
     temp = (*block)[jpeg_natural_order[EOB]];
     if (temp < 0)
       temp = -temp;            /* temp is abs value of input */
     temp >>= Al;         /* apply the point transform */
     if (temp == 1) break;      /* EOB = index of last newly-nonzero coef */
   }
This way you avoid stepping through the whole array twice.
You could possibly furthermore reduce the array accesses if you establish
an additional 'conventional' EOB index cEOB:
2a.
   for (cEOB = cinfo->Se; cEOB > 0; cEOB--) {
     temp = (*block)[jpeg_natural_order[cEOB]];
     if (temp < 0)
       temp = -temp;            /* temp is abs value of input */
     temp >>= Al;         /* apply the point transform */
     if (temp) break;           /* cEOB = index of last nonzero coef */
   }
   for (EOB = cEOB; EOB > 0; EOB--) {
     temp = (*block)[jpeg_natural_order[EOB]];
     if (temp < 0)
       temp = -temp;            /* temp is abs value of input */
     temp >>= Al;         /* apply the point transform */
     if (temp == 1) break;      /* EOB = index of last newly-nonzero coef */
   }
Now use cEOB as main loop limit instead of Se, thus saving
to step through the last run of zeros again.
The condition after loop exit
  if (r > 0 || BR > 0) {  /* If there are trailing zeroes, */
has then also to be changed to
  if (cEOB < Se || BR > 0) {      /* If there are trailing zeroes, */
I haven't actually tried that further modification and am not quite
sure about it, just an idea (cEOB would be 40 in the above example).

Regards
Guido

 
 
 

1. Fast JPEG input using IJG 6a

i'm trying to read a JPG file in using IJG 6a. i can get it to work just
fine, if i use a sample output buffer of 1 scanline, but when i try to do
this :

        JSAMPARRAY buffer = (*cinfo.mem->alloc_sarray)
                ((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride,
                cinfo.output_height);

        int iLinesProcessed = jpeg_read_scanlines(&cinfo, buffer,
cinfo.output_height);

iLinesProcessed is always... 2 . ??

        i am trying to read the whole image in one gulp (attemping to speed things
up). but, jpeg_read_scanlines always returns 2. ignoring the return value
and processing normally, jpeg_finish_decompress tells me that the
application has not transferred enough scanlines.

        does anyone know if this technique (if it's even possible to implement)
will result in a speedup?

        -c

2. tool bar lost

3. IJG JPEG Library question...

4. dpi?

5. Using The Independent JPEG Group's JPEG software

6. Algorithms for Region grow

7. Benchmarking and optimization of machines ...

8. Creating drop shadows

9. IJG/JPEG/DCT help needed

10. IJG Jpeg Library: rdgif?

11. IJG Jpeg - overriding output_message?

12. IJG - jpeg on Windows

13. The IJG source for JPEG and IBM Visual Age C++ compiler