I had an idea a long while ago about how to do possibly faster integer mod
operations for powers of 2 plus or minus 1 (e.g. 3, 7, 9, 15...).
It would be interesting to compare how this would fare with
gcc 2.6x's method for turning a constant division into a
multiplication. Its worst case too is with small moduli.
The first part of the idea is really simple and used to be taught in
school; it is known as casting out nines. Changing to a binary radix,
the rule would be: for powers of two minus 1,
you sum up blocks of bits (size = log (number+1)) in a word.
The second part of the idea comes from the beginning of an old book
``Combinatorial Algorithms,'' by Deo, Nevergelt and Reingold, but
they attribute the idea as going back to the '50s.
Blocks of bits in a word can be summed in parallel in a register using
the usual add and subtract operations provided you make sure that
there is no possibility for a carry to occur between blocks. This is
done by by copying, masking and shifting.
Putting this altogether, here's an example in C
for how x mod 7 would work where x is a 8 byte quantity.
Assume x is in register1 and also is to hold the final result.
r2 = r1;
r1 = r1 & 0xC7; /*1100 0111*/
r2 = r2 & 0x38; /*0011 1000*/
r2 = r2 >> 3;
r1 = r1 + r2;
r2 = r1;
r2 = r2 & 0xC0;
r1 = r1 & 0x3F;
r2 = r2 >> 6;
r1 = r1 + r2;
if (r1 >= 14)
r1 = r1 - 14
else if (r1 >= 7)
r1 = r1 - 7;
For mod 9 the last the r1 = r1 + r2 would be replaced by r1 = r1 - r2
and the "if" would be changed to
if (r1 < -9)
r1 = r1 + 18;
else if (r1 < 0)
r1 = r1 + 9;
Since each statement is roughly an instruction, this would map to
about 16 instructions (3 for the if). On RISC architectures where
there is often a "rotate and mask" and a three register "and"
instruction, the above instruction count would be reduced by maybe 4
instructions in the above example.
If m is log of the nearest power of two of the modulus and n the
number of bits of size of the word result, then the number of
instructions used would be 7 * (log (n - 2m) + 1);
Here, log is the base 2 truncated log. In the above case, n=8 and m=3.
So if the modulus were either 5 or 3 and the result was to be an 8-bit
word, the number of instructions would be about 21 instructions. For
a 32-bit word mod 3 (the toughest case), this would take about 35
I often make mistakes in such computations so take the above with a
grain of salt.