## integer mod with constant modulus.

### integer mod with constant modulus.

I had an idea a long while ago about how to do possibly faster integer mod
operations for powers of 2 plus or minus 1 (e.g. 3, 7, 9, 15...).

It would be interesting to compare how this would fare with
gcc 2.6x's method for turning a constant division into a
multiplication. Its worst case too is with small moduli.

The first part of the idea is really simple and used to be taught in
school; it is known as casting out nines. Changing to a binary radix,
the rule would be: for powers of two minus 1,
you sum up blocks of bits (size = log (number+1)) in a word.

The second part of the idea comes from the beginning of an old book
``Combinatorial Algorithms,'' by Deo, Nevergelt and Reingold, but
they attribute the idea as going back to the '50s.

Blocks of bits in a word can be summed in parallel in a register using
the usual add and subtract operations provided you make sure that
there is no possibility for a carry to occur between blocks. This is
done by by copying, masking and shifting.

Putting this altogether, here's an example in C
for how x mod 7 would work where x is a 8 byte quantity.
Assume x is in register1 and also is to hold the final result.

r2 = r1;
r1 = r1 & 0xC7;   /*1100 0111*/
r2 = r2 & 0x38;   /*0011 1000*/
r2 = r2 >> 3;
r1 = r1 + r2;
r2 = r1;
r2 = r2 & 0xC0;
r1 = r1 & 0x3F;
r2 = r2 >> 6;
r1 = r1 + r2;
if (r1 >= 14)
r1 = r1 - 14
else if (r1 >= 7)
r1 = r1 - 7;

For mod 9 the last the r1 = r1 + r2 would be replaced by r1 = r1 - r2
and the "if" would be changed to
if (r1 < -9)
r1 = r1 + 18;
else if (r1 < 0)
r1 = r1 + 9;

Since each statement is roughly an instruction, this would map to
about 16 instructions (3 for the if). On RISC architectures where
there is often a "rotate and mask" and a three register "and"
instruction, the above instruction count would be reduced by maybe 4
instructions in the above example.

If m is log of the nearest power of two of the modulus and n the
number of bits of size of the word result, then the number of
instructions used would be 7 * (log (n - 2m) + 1);
Here, log is the base 2 truncated log. In the above case, n=8 and m=3.
So if the modulus were either 5 or 3 and the result was to be an 8-bit
word, the number of instructions would be about 21 instructions. For
a 32-bit word mod 3 (the toughest case), this would take about 35
instructions.

I often make mistakes in such computations so take the above with a
grain of salt.

--

### integer mod with constant modulus.

> [Do fast integer mod (2^x +/- 1) computations by casting out nines (or
>  other appropriate factor and then summing blocks of bits in a word in
>  parallel, making sure there is no carry between blocks.]

This is a really neat technique which was described in HAKMEM and is used
in the X11 sources to count the number of bits in a word (which is actually
a modulus operation).  It doesn't require any branches.  On some
architectures the final modulus may be expensive; it can be replaced by a

-Michael Ernst

/*
* This function counts the number of bits in a long.
* It is limited to 63 bit longs, but a minor mod can cope with 511 bits.
*
* It is so magic, an explanation is required:
* Consider a 3 bit number as being
*      4a+2b+c
* if we shift it right 1 bit, we have
*      2a+b
* subtracting this from the original gives
*      2a+b+c
* if we shift the original 2 bits right we get
*      a
* and so with another subtraction we have
*      a+b+c
* which is the number of bits in the original number.
* Suitable masking allows the sums of the octal digits in a
* 32 bit number to appear in each octal digit.  This isn't much help
* unless we can get all of them summed together.
* This can be done by modulo arithmetic (sum the digits in a number by molulo
* the base of the number minus one) the old "casting out nines" trick they
* taught in school before calculators were invented.
* Now, using mod 7 wont help us, because our number will very likely have
* more than 7 bits set.  So add the octal digits together to get base64
* digits, and use modulo 63.
* (Those of you with 64 bit machines need to add 3 octal digits together to
* get base512 digits, and use mod 511.)
*
* This is HAKMEM 169, as used in X11 sources.
*/
static int
t0_hackmemmod(register unsigned long n)
{
register unsigned long tmp;

tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}
--

### integer mod with constant modulus.

> The first part of the idea is really simple and used to be taught in
> school; it is known as casting out nines. Changing to a binary radix,
> the rule would be: for powers of two minus 1,
> you sum up blocks of bits (size = log (number+1)) in a word.

Or put slightly more formally:
an              n                    a
m * 2      ==   m * b   mod p      if    2   ==  b   mod p

For the case where b = -1, 0, or 1, this leads to optimizations where
you sum the chunks of bits, with alternating sign (b=-1), all positive
(b=1), or forget all but the least significant chunk (b=0).

If b isn't -1,0 or 1, it is still possible to use this method, but it
becomes uglier:  for example the nearest prime to 2^12 is 4093, making
b = +3.  For a 32 bit value you can use the relation
12                     12
m * 2  + l   ==  3m+l  mod 2  - 3

and simply apply it a few times before the final test (this avoids
having to deal with higher powers of b, which might be more expensive
to multiply by)  Clearly using this method to reduce modulo 23 will
not be very optimal (b = -7 (8 chunks) or +9 (7 chunks)).

In practice there are not many moduli that are amenable to this
method, unless the compiler gets to chose the modulus itself and can
make it close to a power of 2 (!!).

Quote:> Blocks of bits in a word can be summed in parallel in a register using
> the usual add and subtract operations provided you make sure that
> there is no possibility for a carry to occur between blocks. This is
> done by by copying, masking and shifting.

These precautions include taking care if your dividend is negative,
since  2^32 mod p  is unlikely to be zero!

| http://www.harlequin.co.uk/           |   +44 1223 873829                 |
--

### integer mod with constant modulus.

> This is a really neat technique which was described in HAKMEM and is used
> in the X11 sources to count the number of bits in a word (which is actually
> a modulus operation).  It doesn't require any branches.  On some
> architectures the final modulus may be expensive; it can be replaced by a

Actually, I'd be interested if you could name an architecture for
which the final modulus is not expensive.  If I just code up
population count in the ordinary way (see below), I find it is faster
on all the different machine-compiler combiations that I have the
patience to try.

Quote:>  /*
>   * This is HAKMEM 169, as used in X11 sources.
>   */
>  static int
>  t0_hackmemmod(register unsigned long n)
>  {
>     register unsigned long tmp;

>     tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
>     return ((tmp + (tmp >> 3)) & 030707070707) % 63;
>  }

time in seconds
hakmem   hakmem'*4  naive   compiler & flags
HP-UX A.09.05 C 9000/710    19.6      8.3       6.9     c89 +O3
IBM Power something         8.2      5.1       4.0     xlc -O3
MicroSparc (50 Mhz)         18.6     10.8       8.8     acc 3.0.1 -fast -cg92 *2
MicroSparc (50 Mhz)         30.7      9.5       9.0     acc 3.0.1 -fast -cg89 *3
MicroSparc (50 Mhz)         29.9      8.9       8.0     gcc 2.5.8 -O3

SuperSparc (50 Mhz)          8.5      5.7       5.2     cc 3.0.1 -fast  -xcg92
SuperSparc (50 Mhz)         21.1      5.7       5.4     cc 3.0.1 -fast  -xcg89

Sparc ELC (Cypress, 33 Mhz)   *1     16.6      15.6     cc 3.0.1 -fast  -xcg92
Sparc ELC (Cypress, 33 Mhz) 52.6     16.7      15.6     cc 3.0.1 -fast  -xcg89
Sparc ELC (Cypress, 33 Mhz) 47.7     10.8       8.3     gcc 2.6.3 -O3
Sparc ELC (Cypress, 33 Mhz) 49.5     13.4      12.3     cc 3.0.1 -fast -xO4 -xcg89

*1 I went to the bathroom and made tea, came back, and it still wasn't done.
The missing instruction is emulated by a trap to the OS.

*2 "-cg89" indicates scheduling and instruction selection tuned to run well
on the SparcStation 2.  In particular, no hardware division.

*3 "-cg92" indicates sched. and inst. selection tuned to run well on the
SuperSparc and MicroSparc chips.  In particular, support for division
in hardware.

*2 & *3 -- as a former Sun employee and compiler hacker, I am very familiar
with the compiler flags and versions.  I am not so familiar with the ins and
outs of IBM and HP compilers.  I also tried various GCC versions to test the
hypothesis that X11 and GCC were tuned for one another -- that is, GCC might
be loaded with optimizations tuned for code like what is in hakmem (and, in
fact, gcc optimizes the mask construction better than cc).

*4 For hakmem-prime, I performed the mod-63 operation with hand-written shifts,
masks, and additions.  I may not have written it optimally, but it appears
to be correct (I tested it separately on a wide range of inputs, and of
course I worked it out first carefully by hand).

So, what do we learn from this:

- if this is really used in X11, it is a mistake.  Why make use of obscure,
slow code when "straightforward" code is ALWAYS faster?  (Yet more fodder
for the X11-haters.)

- it looks like it really would make sense to perform some constant remainders
beat hakmem by a wide margin, and the only difference between the two was the
optimized constant-remainder.

- be wary of untested "great hacks", even when they come from a renowned
Institute of Technology.

Here's the test code:
--------------------------------------------------
#include <stdio.h>

#ifdef NAIVE
int
popcount(unsigned long n)
{
unsigned long x, y;
x = n & 0x55555555;
y = (n >> 1) & 0x55555555;
n = x + y;
x = n & 0x33333333;
y = (n >> 2) & 0x33333333;
n = x + y;
x = n & 0x0f0f0f0f;
y = (n >> 4) & 0x0f0f0f0f;
n = x + y;
x = n & 0x00ff00ff;
y = (n >> 8) & 0x00ff00ff;
n = x + y;
x = n & 0x0000ffff;
y = (n >> 16);
n = x + y;
return n;

Quote:}

#endif
#ifdef HAKMEM
int
popcount(unsigned long n)
{
unsigned long tmp;

tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;

Quote:}

#endif
#ifdef HAKMEM_P
int
popcount(unsigned long n)
{
unsigned long tmp, x, y;

tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
tmp = ((tmp + (tmp >> 3)) & 030707070707);

/* Now take remainder mod 63 */
x = tmp & 07700770077;
y = (tmp >> 6) & 07700770077;
tmp = x + y; /* tmp contains 3 sums */
tmp = ((tmp >> 12) + (tmp >> 24) + tmp) & 0777;
tmp = (tmp >> 6) + (tmp & 077);
tmp = tmp + 1;
tmp = (tmp >> 6) + (tmp & 077) - 1;
return tmp;

Quote:}

#endif
main() {
int i;
int j = 0;

for (i = 0; i < 10000000; i++) {
j = j + popcount(i);
}
printf("j = %d\n", j);

Quote:}

--------------------------------------------------

speaking for myself,

David Chase
[I'm pretty sure it was a win on the PDP-6. -John]
--

### integer mod with constant modulus.

> - be wary of untested "great hacks", even when they come from a renowned
>   Institute of Technology.

I think that it would be fairer to let HAKMEM speak for itself.

If anyone is interested, the whole HAKMEM is available online at:
ftp://ftp.netcom.com/pub/hb/hbaker/hakmem/hakmem.html

There is also a pointer to the tiff G4 compressed scanned version
of HAKMEM in the html document.

Below is the actual text of HAKMEM 169.  I can't find a single mention
of C, RISC, SPARC, ALPHA, POWERPC, Pentium, etc.  Go figure.

<h3>ITEM 169 (in order of one-ups-manship: Gosper, Mann, Lenard, [Root and
Mann]):</h3>

To count the ones in a <a href="../pdp-10/pdp-10.html">PDP-6/10</a> word:

<tt><pre>
LDB B,[014300,,A]      ;or MOVE B,A then LSH B,-1
AND B,[333333,,333333]
SUB A,B
LSH B,-1
AND B,[333333,,333333]
SUBB A,B               ;each octal digit is replaced by number of
1's in it
LSH B,-3
AND A,[070707,,070707]
IDIVI A,77             ;casting out 63.'s
</pre></tt>

These ten instructions, with constants extended, would work on word lengths up
to 62.; eleven suffice up to 254..

--
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
[Aha.  It WAS faster on the PDP-6. -John]
--

### integer mod with constant modulus.

| int
| popcount(unsigned long n)
| {
|   unsigned long x, y;
|   x = n & 0x55555555;
|   y = (n >> 1) & 0x55555555;
|   n = x + y;
.....
|   y = (n >> 8) & 0x00ff00ff;
|   n = x + y;
|   x = n & 0x0000ffff;
|   y = (n >> 16);
|   n = x + y;
|   return n;
| }

Yes, this is a pretty fast version.  The big constants hurt a RISC
though, so I removed the masking from bit that weren't used.  This way
of counting takes 16 cycles when used in a loop on a SPARC.

Needless to say that this is even a greater win on a 64 bit
architecture.  This really should be part of any C-programmers bagage,
but browsing fx. the Linux sources, we clearly see it ain't.

#ifdef __GNUC__
inline
#endif
static unsigned
bitcount(unsigned x)
{
/* Clasical way to count set bits in a word, sligtly optimized */
/* This takes 16 SPARC cycles, when done in a loop ;^)         */
/* Assumes a 32-bit architecture                               */

x =     (x >> 1  & 0x55555555) + (x & 0x55555555);
x =    ((x >> 2) & 0x33333333) + (x & 0x33333333);
x =    ((x >> 4) + x) & 0x0f0f0f0f;
x =    ((x >> 8) + x);
return (x + (x >> 16)) & 0xff;

Quote:}

...now back to compiler issues.
--

Hi, everyone

Could you tell me memory section, "bss", "data", or etc., do the static
variables and
constant integers belong to ?