estimating CPU load /MFLOPS for software emulation of floating point

estimating CPU load /MFLOPS for software emulation of floating point

Post by Christopher Holm » Sat, 20 Dec 2003 04:14:42



In an upcoming hardware design I'm thinking about using a CPU without
a floating point unit.  The application uses floating point numbers,
so I'll have to do software emulation.  However, I can't seem to find
any information on how long these operations might take in software.
I'm trying to figure out how much processing power I need & choose an
appropriate CPU.

I have plently of info on MIPS ratings for the CPU's, and I figured
out how many MFLOPS my application needs, but how do I figure out how
many MIPS it takes to do so many MFLOPS?

Does anyone know of any info resources or methods?

Thanks for any help!
Chris

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Nick Maclar » Sat, 20 Dec 2003 04:38:54




>In an upcoming hardware design I'm thinking about using a CPU without
>a floating point unit.  The application uses floating point numbers,
>so I'll have to do software emulation.  However, I can't seem to find
>any information on how long these operations might take in software.
>I'm trying to figure out how much processing power I need & choose an
>appropriate CPU.

>I have plently of info on MIPS ratings for the CPU's, and I figured
>out how many MFLOPS my application needs, but how do I figure out how
>many MIPS it takes to do so many MFLOPS?

>Does anyone know of any info resources or methods?

Lots of the latter, but the former are mostly in people's heads or
on paper.  Old paper.

If you want to emulate a hardware floating-point format, you are
talking hundreds of instructions or more, depending on how clever
you are and the interface you use.  If you merely want to implement
floating-point in software, then you can get it down to tens of
instructions.  For example, holding floating-point numbers as a
structure designed for software, like:

    struct (unsigned long mantissa, int exponent, unsigned char sign)

is VASTLY easier than emulating IEEE.  It's still thoroughly messy.

Regards,
Nick Maclaren.

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Bob » Sat, 20 Dec 2003 04:53:03



Quote:> In an upcoming hardware design I'm thinking about using a CPU without
> a floating point unit.  The application uses floating point numbers,
> so I'll have to do software emulation.  However, I can't seem to find
> any information on how long these operations might take in software.
> I'm trying to figure out how much processing power I need & choose an
> appropriate CPU.

> I have plently of info on MIPS ratings for the CPU's, and I figured
> out how many MFLOPS my application needs, but how do I figure out how
> many MIPS it takes to do so many MFLOPS?

> Does anyone know of any info resources or methods?

> Thanks for any help!
> Chris

If you absolutely must use normalized FP (a la IEEE) it could be hundreds or
even thousands depending onthe CPU resources and the cleverness of the code.
Look and un-normalized FP or even integer. Normalization results in
non-deterministic timing. Of course, if your CPU doesn't have hardware
multiply, then all your math timing is non-deterministic ;-)

Very few things really need FP - the algorithm designers are just lazy. A 32
bit integer has better than 1 ppb (1 part per billion) resolution. Most
things in the real world (like ADCs and DACs) aren't anywhere near that.

Bob

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Andrew Reill » Sat, 20 Dec 2003 06:34:59



> If you merely want to implement
> floating-point in software, then you can get it down to tens of
> instructions.  For example, holding floating-point numbers as a
> structure designed for software, like:

>     struct (unsigned long mantissa, int exponent, unsigned char sign)

> is VASTLY easier than emulating IEEE.  It's still thoroughly messy.

Why would you muck about with a separate sign, rather than just using a
signed mantissa, for a non-standard software implementation?  Does it buy
you something in terms of speed?  Precision, I guess, given that long is
only 32 bits on many systems, and few have 64x64->128 integer multipliers
anyway.  The OP didn't say what the application was, so it's hard to say
whether more than 32 bits of mantissa would be needed.

Frankly, he's almost certainly going to be able to translate to
fixed-point or block-floating-point anyway, and not bother with the
per-value exponent field.  That's what all of the "multi-media"
applications that run on integer-only ARM, MIPS, SH-RISC etc do.  Modern
versions of these chips all have strong (low latency, pipelined) integer
multipliers, so performance can be quite good.

Cheers,

--
Andrew

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Paul Keinane » Sat, 20 Dec 2003 07:21:36




Quote:

>If you absolutely must use normalized FP (a la IEEE) it could be hundreds or
>even thousands depending onthe CPU resources and the cleverness of the code.
>Look and un-normalized FP or even integer. Normalization results in
>non-deterministic timing. Of course, if your CPU doesn't have hardware
>multiply, then all your math timing is non-deterministic ;-)

Floating point multiplication and division is not much worse than
doing integer multiplication or division with operands of similar
sizes. Only an extra addition/subtraction is involved.

However, floating point addition and subtractions are *, since you
first have to denormilze the smaller value and then perform the
addition/subtraction in the normal way. Especially after subtraction,
you often have to find the most significant bit set and do the
denormalisation, which can be quite time consuming.

However, even if you would have to normalize a 64 bit mantissa with an
8 bit processor, you could first test in which byte the first "1" bit
is located and by byte copying (or preferably pointer arithmetic) move
that byte to the beginning of the result. After that you have to
perform 1-7 full sized (64 bit) left shift operations (or 1-4 bit
left/right shifts) to get into correct positions. Rounding requires up
to 8 adds with carry.

Even so, I very much doubt that you would require more than 100
instruction in addition to the actual integer multiply/add/sub
operation with the same operand sizes.

An 8 by 8 bit multiply instruction would reduce the computational load
considerably.

Paul

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Nick Maclar » Sat, 20 Dec 2003 08:40:00




Quote:

>Why would you muck about with a separate sign, rather than just using a
>signed mantissa, for a non-standard software implementation?  Does it buy
>you something in terms of speed?  Precision, I guess, given that long is
>only 32 bits on many systems, and few have 64x64->128 integer multipliers
>anyway.  The OP didn't say what the application was, so it's hard to say
>whether more than 32 bits of mantissa would be needed.

It buys some convenience, and probably a couple of instructions fewer
for some operations.  Not a big deal.

Quote:>Frankly, he's almost certainly going to be able to translate to
>fixed-point or block-floating-point anyway, and not bother with the
>per-value exponent field.  That's what all of the "multi-media"
>applications that run on integer-only ARM, MIPS, SH-RISC etc do.  Modern
>versions of these chips all have strong (low latency, pipelined) integer
>multipliers, so performance can be quite good.

See "scaling" in any good 1930s book on numerical analysis :-)

Regards,
Nick Maclaren.

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by glen herrmannsfeld » Sat, 20 Dec 2003 09:31:55


(snip regarding software floating point)

Quote:> Even so, I very much doubt that you would require more than 100
> instruction in addition to the actual integer multiply/add/sub
> operation with the same operand sizes.
> An 8 by 8 bit multiply instruction would reduce the computational load
> considerably.

The 6809 has an 8 by 8 multiply, but the floating point
implementations I knew on it didn't use it.  I looked
at it once, and I don't think it was all that much faster
to use it.

-- glen

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by CBFalcone » Sat, 20 Dec 2003 14:26:36



> In an upcoming hardware design I'm thinking about using a CPU
> without a floating point unit.  The application uses floating
> point numbers, so I'll have to do software emulation.  However, I
> can't seem to find any information on how long these operations
> might take in software. I'm trying to figure out how much
> processing power I need & choose an appropriate CPU.

There was a time when you had no choice.  You should also decide
on the precision levels needed in the FP system.  Many years ago I
decided that my applications could be adequately handled with a 16
bit significand, and the result was the FP system for the 8080
published in DDJ about 20 years ago.  The actual code is probably
of little use today, but the breakdown may well be.

That was fairly efficient and speedy because the 8080 was capable
of 16 bit arithmetic, and it was not hard to extend it to 24 and
32 bits where needed.

--

   Available for consulting/temporary embedded and systems.
   <http://cbfalconer.home.att.net>  USE worldnet address!

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Mike Cowlisha » Sat, 20 Dec 2003 17:50:06



> I have plently of info on MIPS ratings for the CPU's, and I figured
> out how many MFLOPS my application needs, but how do I figure out how
> many MIPS it takes to do so many MFLOPS?

> Does anyone know of any info resources or methods?

Check out John Hauser's SoftFloat package, at:

  http://www.jhauser.us/arithmetic/SoftFloat.html

He quotes some timings on that page, and/or you could
measure the calculations you are interested in for
yourself..

Turning his timings for doubles into number of clock
cycles per operation, one gets roughly:

  Add: 305
  Mul:  285
  Div:  605

On a Pentium, Add and Multiply take 1-3 cycles,
Divide takes 39, so for add or multiply you're
looking at 2 orders of magnitude slowdown, for
divide, nearer to one.

(As others have pointed out, with a non-standard
floating-point format and arithmetic one can go
faster than that.)

Mike Cowlishaw

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Everett M. Gree » Sat, 20 Dec 2003 17:53:41


(Nick Maclaren) writes:

> >In an upcoming hardware design I'm thinking about using a CPU without
> >a floating point unit.  The application uses floating point numbers,
> >so I'll have to do software emulation.  However, I can't seem to find
> >any information on how long these operations might take in software.
> >I'm trying to figure out how much processing power I need & choose an
> >appropriate CPU.

> >I have plently of info on MIPS ratings for the CPU's, and I figured
> >out how many MFLOPS my application needs, but how do I figure out how
> >many MIPS it takes to do so many MFLOPS?

> >Does anyone know of any info resources or methods?

> Lots of the latter, but the former are mostly in people's heads or
> on paper.  Old paper.

> If you want to emulate a hardware floating-point format, you are
> talking hundreds of instructions or more, depending on how clever
> you are and the interface you use.  If you merely want to implement
> floating-point in software, then you can get it down to tens of
> instructions.  For example, holding floating-point numbers as a
> structure designed for software, like:

>     struct (unsigned long mantissa, int exponent, unsigned char sign)

> is VASTLY easier than emulating IEEE.  It's still thoroughly messy.

And speaking of emulating IEEE 754 float operations, speed and
code size go south in a big hurry if infinities, denormalized
numbers, NaNs, and rounding are handled properly.  Add some
more adverse impact if double-precision float is implemented
instead of or in addition to the usual single-precision float.

Regardless, MFLOPS will be measured in fractions and quite
small fractions at that.  Any relation between MIPS and MFLOPS
will be purely coincidental.

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Mike Cowlisha » Sun, 21 Dec 2003 02:42:08



> And speaking of emulating IEEE 754 float operations, speed and
> code size go south in a big hurry if infinities, denormalized
> numbers, NaNs, and rounding are handled properly.

Those are rare cases -- affect code size, yes, but only a small effect on
speed.

Quote:> Regardless, MFLOPS will be measured in fractions and quite
> small fractions at that.  Any relation between MIPS and MFLOPS
> will be purely coincidental.

I would expect them to be linearly related.

Mike Cowlishaw

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Christopher Holm » Sun, 21 Dec 2003 04:10:22


What is "block floating point"?  



> >Why would you muck about with a separate sign, rather than just using a
> >signed mantissa, for a non-standard software implementation?  Does it buy
> >you something in terms of speed?  Precision, I guess, given that long is
> >only 32 bits on many systems, and few have 64x64->128 integer multipliers
> >anyway.  The OP didn't say what the application was, so it's hard to say
> >whether more than 32 bits of mantissa would be needed.

> It buys some convenience, and probably a couple of instructions fewer
> for some operations.  Not a big deal.

> >Frankly, he's almost certainly going to be able to translate to
> >fixed-point or block-floating-point anyway, and not bother with the
> >per-value exponent field.  That's what all of the "multi-media"
> >applications that run on integer-only ARM, MIPS, SH-RISC etc do.  Modern
> >versions of these chips all have strong (low latency, pipelined) integer
> >multipliers, so performance can be quite good.

> See "scaling" in any good 1930s book on numerical analysis :-)

> Regards,
> Nick Maclaren.

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Nick Maclar » Sun, 21 Dec 2003 05:13:20





>> And speaking of emulating IEEE 754 float operations, speed and
>> code size go south in a big hurry if infinities, denormalized
>> numbers, NaNs, and rounding are handled properly.

>Those are rare cases -- affect code size, yes, but only a small effect on
>speed.

Regrettably not :-(

That has been stated for years, but isn't true.  Yes, it is true, if
measured over the space of all applications on all data.  No, it is
not true for all analyses, even excluding perverse and specially
selected ones.  It isn't all that rare to get into a situation where
5-10% of all floating-point calculations are in a problem area (i.e.
underflowing or denormalised), despite the data and results being
well scaled.

Quote:>> Regardless, MFLOPS will be measured in fractions and quite
>> small fractions at that.  Any relation between MIPS and MFLOPS
>> will be purely coincidental.

>I would expect them to be linearly related.

Yes and no.  They are only if the characteristics of the machine
remains constant.  As branch misprediction becomes more serious,
MFlops degrades relative to MIPS.

Regards,
Nick Maclaren.

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Terje Mathise » Sun, 21 Dec 2003 03:10:22



> However, even if you would have to normalize a 64 bit mantissa with an
> 8 bit processor, you could first test in which byte the first "1" bit
> is located and by byte copying (or preferably pointer arithmetic) move
> that byte to the beginning of the result. After that you have to
> perform 1-7 full sized (64 bit) left shift operations (or 1-4 bit
> left/right shifts) to get into correct positions. Rounding requires up
> to 8 adds with carry.

> Even so, I very much doubt that you would require more than 100
> instruction in addition to the actual integer multiply/add/sub
> operation with the same operand sizes.

I've done something quite similar when I implemented a full 128-bit fp
library, based on 32-bit Pentium asm.

I used a slightly non-standard approach, in that I used a 1:31:96 format
for my numbers, instead of 1:15:112 which is sort-of-standard.

A hw version should at least use a mantissa with more than twice as many
bits as a double, so 107 bits would be the minimum.

Quote:

> An 8 by 8 bit multiply instruction would reduce the computational load
> considerably.

If you don't have even that, but a little room in ram,, then I suggest a
table of squares.

Terje

--

"almost all programming can be viewed as an exercise in caching"

 
 
 

estimating CPU load /MFLOPS for software emulation of floating point

Post by Everett M. Gree » Mon, 22 Dec 2003 01:18:42




> > And speaking of emulating IEEE 754 float operations, speed and
> > code size go south in a big hurry if infinities, denormalized
> > numbers, NaNs, and rounding are handled properly.

> Those are rare cases -- affect code size, yes, but only a small
> effect on speed.

But every operation pays the price of checking for the rare
values whether they are present/occur or not.

Quote:> > Regardless, MFLOPS will be measured in fractions and quite
> > small fractions at that.  Any relation between MIPS and MFLOPS
> > will be purely coincidental.

> I would expect them to be linearly related.

Not across processor families...
 
 
 

1. 32 bit floating point emulation library for Borland Compiler

Hello dear SW guys,

I am serching for a 32bit floating point library for Borland V4.5 or V5.0
compilers. The SW should run together with the library in an embedded
enviroment (386ex controller).

I am looking forward for your experiences or information about proper
libraries.

-----------------------------------------------------------
Achim Morkramer                    AEG Schneider Automation
-----------------------------------------------------------
Tel.: +49 6182 81 2827             ASAD/E24
FAX : +49 6182 81 2920             Steinheimer Str. 117

-----------------------------------------------------------

2. Using a moreproc on mhl with command line options

3. Embedded '186 with Borland C++ 4.52 (Floating point emulation)

4. Printing Entire Site

5. Floating point emulation

6. interrupt occurs fatally in sigaction

7. Help getting Watcom 11.0 floating point emulation working in non-DOS environment

8. assignment 03

9. 32 bit floating point emulation library for Borland Compiler

10. base 2 floating point to base 10 floating point

11. Strange behaviour of Alpha Floating Point Load/Stores wrt cache misses

12. Strange behaviour of Alpha Floating Point Load/Store