I had an idea a long while ago about how to do possibly faster integer mod

operations for powers of 2 plus or minus 1 (e.g. 3, 7, 9, 15...).

It would be interesting to compare how this would fare with

gcc 2.6x's method for turning a constant division into a

multiplication. Its worst case too is with small moduli.

The first part of the idea is really simple and used to be taught in

school; it is known as casting out nines. Changing to a binary radix,

the rule would be: for powers of two minus 1,

you sum up blocks of bits (size = log (number+1)) in a word.

The second part of the idea comes from the beginning of an old book

``Combinatorial Algorithms,'' by Deo, Nevergelt and Reingold, but

they attribute the idea as going back to the '50s.

Blocks of bits in a word can be summed in parallel in a register using

the usual add and subtract operations provided you make sure that

there is no possibility for a carry to occur between blocks. This is

done by by copying, masking and shifting.

Putting this altogether, here's an example in C

for how x mod 7 would work where x is a 8 byte quantity.

Assume x is in register1 and also is to hold the final result.

r2 = r1;

r1 = r1 & 0xC7; /*1100 0111*/

r2 = r2 & 0x38; /*0011 1000*/

r2 = r2 >> 3;

r1 = r1 + r2;

r2 = r1;

r2 = r2 & 0xC0;

r1 = r1 & 0x3F;

r2 = r2 >> 6;

r1 = r1 + r2;

if (r1 >= 14)

r1 = r1 - 14

else if (r1 >= 7)

r1 = r1 - 7;

For mod 9 the last the r1 = r1 + r2 would be replaced by r1 = r1 - r2

and the "if" would be changed to

if (r1 < -9)

r1 = r1 + 18;

else if (r1 < 0)

r1 = r1 + 9;

Since each statement is roughly an instruction, this would map to

about 16 instructions (3 for the if). On RISC architectures where

there is often a "rotate and mask" and a three register "and"

instruction, the above instruction count would be reduced by maybe 4

instructions in the above example.

If m is log of the nearest power of two of the modulus and n the

number of bits of size of the word result, then the number of

instructions used would be 7 * (log (n - 2m) + 1);

Here, log is the base 2 truncated log. In the above case, n=8 and m=3.

So if the modulus were either 5 or 3 and the result was to be an 8-bit

word, the number of instructions would be about 21 instructions. For

a 32-bit word mod 3 (the toughest case), this would take about 35

instructions.

I often make mistakes in such computations so take the above with a

grain of salt.

--