>> That's the simplest algorithm, but there is a faster one, which I know

>> from Knuth's _Art of Computer Programming Volume 2; Seminumerical

>> Algorithms_:

>> to multiply 2 2n-bit numbers u and v

...

>Shouldn't that middle term be (U1-U0)(V0-V1)? i.e.

>u * v = 2^2n U1V1 + 2^n (U1V1 + (U1-U0)(V0-V1) + U0V0) + U0V0

>Otherwise the U1V1 and U0V0 in the middle term don't cancel out.

Yes, you're correct. Thanks for noting that.

Quote:>Surely this needs a signed multiply for the (U1-U0)(V0-V1) term, with

>appropriate handling for the subsequent additions.

>I can't see how this would work out more efficiently if the processor

>only has an unsigned multiply instruction.

If the unsigned multiply operation is expensive, in terms of

processing time, (e.g. the processor has NO muliply instruction at

all) it is pretty easy to verify. Basically, the *algorithm* is more

efficient because we're trading one multiplication for a few additions

and subtractions, but the break-even point in the real world, on real

processors, depends on the relative costs of those operations and the

actual size of the multiplicands relative to the "native" size for the

machine. If we're doing multiplies of two 64-bit numbers on a 68HC11,

I think it's clear that it's going to come out ahead of the classic

O(n^2) implementation. If we're just multiplying two 16-bit numbers,

the timing probably depends more heavily on the details of the actual

implementation. As I mentioned, I haven't done the full analysis of

the cycle counts for that processor -- just a "back of the envelope"

calculation that seems to indicate that it's at least worth looking

into.

Quote:>It would be necessary to

>check for negative operands, convert them to positive and calculate the

>sign of the product, correct the result as appropriate, and deal with

>mixed signed and unsigned addition to complete the 2^n term.

The middle term does need a signed multiply, but this is not too

difficult to effect, just as you've mentioned. As for mixed signed

and unsigned addition, they all look the same if we're dealing with

two's complement numbers. OTOH, if we're dealing with signed

magnitude numbers (like Knuth's pedagogical MIX processor) then it's

trivial to take the absolute value of a number, and we still don't

have any particular problem.

Quote:>Even with a signed multiply instruction, the U1-U0 and V0-V1 terms are

>not representable as an n-bit signed integer: at least one more bit is

>needed for the sign, since the difference terms may range between

>-(2^n - 1) and +(2^n - 1).

Actually exactly one "scratch" bit is needed, but more may be used for

convenience. First, do the first subtraction and take the absolute

value and store the original sign in the scratch bit location. Do the

second subtraction and take its absolute value, and XOR this sign with

the scratch bit. Do an unsigned multiply and negate the result if the

scratch bit is set. The scratch bit does not have to be initialized

to any particular value, and it doesn't matter which sign is

represented by a 1 as long as both tests are consistent and use the

same representation.

On machines like the 68HC11, which don't have a bit complement

instruction, it might be handier to burn up a whole byte-wide register

-- in that case, you'd just zero the reg and increment it for each

time you got, say, a negative number. After doing the multiply, if

the low bit of the scratch byte is one, complement the result of the

multiplication. Again, while the algorithm is more efficient, the

details of the particular machine have a great deal to do with whether

it's really more practical in a given situation. Thanks for the great

questions!

Ed