ECC or non/ECC Memory

ECC or non/ECC Memory

Post by Gregory Abb » Wed, 14 Oct 1998 04:00:00



I'm building a 350 MHz Pentium based Linux box with an Asus P2B
mainboard.  I need to make a decision on wether to use ECC or non/ECC
RAM.  Is there any overhead associated with error correction??  What
are the pros and cons??  I know that ECC will cost more!!
 
 
 

ECC or non/ECC Memory

Post by david kahan » Wed, 14 Oct 1998 04:00:00



> I'm building a 350 MHz Pentium based Linux box with an Asus P2B
> mainboard.  I need to make a decision on wether to use ECC or non/ECC
> RAM.  Is there any overhead associated with error correction??  What
> are the pros and cons??  I know that ECC will cost more!!

I think that there is some time overhead
for ECC relative to non ECC RAM. The ECC
algorithm takes some time to execute. I don't
think that it amounts to a major overhead in actual
use, maybe only a couple of percent, from an
extra wait state or so per memory read.

There used to be something called `parity' memory.
Probably there still is, though I've heard that
to save a (little) bit of money it's being eliminated
in some mb designs. Anyway I have an old 286 with 640
KB of parity memory. Parity operates just as fast as
non-parity memory. It can detect one bit errors but not
correct them.

Therefore if you got a parity error, you were basically
dead. The machine crashed (I think the memory controller
generated a NMI and shut down the processor). Then the
BIOS told you there was a parity error. You could then
reboot and try again, or replace the memory, but that was
about it.

How common are these errors? I don't know in general.
It actually never happened to me on my 286, not even once,
and I used that machine for about three or four years, though
not continuously by any means. But I saw it happen on other
old PC's.

The ECC chipsets for PC's that I have heard of can
correct single bit errors in a 64 bit block of
information, and can detect 2, 3 or 4 bit errors
but can't correct them. A NMI is generated if one
of those comes up, and the processor shuts down.

In any case you usually have to enable ECC in the BIOS
if you have it on the motherboard. On some you can choose
ECC, parity only, or non-parity. The single bit errors
are corrected in ECC mode, and possibly recorded. If there
is any pattern to them, you probably have a hardware problem.
The operating system can keep records of this. Linux
can, I believe, and my motherboard actually keeps some
records in some buffers in the BIOS I think which I can check
on a reboot. I don't actually know how to get Linux to do it though.
Anyone?

I don't think Win95 can do it at all. Maybe NT can but I have
no experience.

I have so far never seen any single bit errors on
my Linux box at home. It has ECC memory, is up more
or less continuously, has been running about 1.5 years
and I use it reasonably heavily for numerical calculations.
It's not a server or anything, but it is used to run
CPU/memory intensive programs that take a long time to
finish. That is why I used ECC memory.

I think I probably could have gotten away without using ECC,
but then again, maybe the ECC memory I bought is a bit higher
quality, and that's why I haven't gotten any errors. If I do
in the future, they would presumably will be corrected,
too -- I am not looking forward to the day, but I'm not very
worried about it either ;)

In summary, I would say if you are not using your system
in a very critical spot, as a server which really shouldn't
go down much, and which should maybe give you warning if
the memory is about to go bad, then you probably can do
without ECC. Otherwise, it's probably worth it, and you will
probably not care about the extra cost in that situation.

Sorry for the long response.

cheers,

-dave k.

 
 
 

ECC or non/ECC Memory

Post by david kahan » Wed, 14 Oct 1998 04:00:00



> How common are these errors? I don't know in general.

For the record, I found some measurements on the web
relevant to the rate of soft single bit errors expected from
cosmic ray background radiation. See:

 http://net.wpi.edu/ram/ibmnasa/ibm

It has improved over time and is manufacturer dependent.
It seems to go from about 1 error per year in a 256 Kb chip in
1986 to as low as 0.00046 errors per year in a 4Mb chip in 1993.
But there is a range of about a factor of 100 depending on the
manufacturer.

I don't know whether cosmic rays are the main source of
errors. Seemingly electronic noise might matter too, and who
knows what else. But these seem to be actual measurements.

I suppose they were taken at sea level ....

NASA also did some research, by directly bombarding the
chips with proton beams:

 http://flick.gsfc.nasa.gov/radhome/papers/d121696a.htm

But they only give you the cross-section per bit, for a
16Mb chip. You will have to work out the error rate given
the cosmic ray flux at your location. I'm leaving that as
an exercise for the reader :)

cheers,

- dave k.

 
 
 

ECC or non/ECC Memory

Post by RobinHood » Wed, 14 Oct 1998 04:00:00




>I'm building a 350 MHz Pentium based Linux box with an Asus P2B
>mainboard.  I need to make a decision on wether to use ECC or non/ECC
>RAM.  Is there any overhead associated with error correction??  What
>are the pros and cons??  I know that ECC will cost more!!

You can get 8ns cycle time 6ns access time ECC Memory it just costs
more.  I have worked in the UNIX Server world for a while now and
Memory does fail.  On OS's other than Linux you can get reports about
your memory.  Every few months we had issues.  Magnetic disturbence,
temperature fluctuations most errors were logged and corrected by the
ECC memory.  It takes some of the Randomness out of your system.  It is
not cost effective unless you really have criticle data on your system.

On the same token, I bought 128MB 8ns/8ns NEC DIMM, for my computer.  I
can't stand it when computers act spurious.  My NT machine at my office
is continually having memory troubles, and I believe it is having them
more than it reports them, which is part of the reason it reboots, and
my compiler crashes now and then.... There is also the shitty software
theory... But my old Compaq had different instability problems.

Most memory failures don't crash systems, they just make them act
weird, and they wreak havoc on your compiling.

--
-R*S

 
 
 

ECC or non/ECC Memory

Post by Henrik Carlqvis » Wed, 14 Oct 1998 04:00:00




> > Is there any overhead associated with error correction??

> I think that there is some time overhead
> for ECC relative to non ECC RAM. The ECC
> algorithm takes some time to execute.

As far as I know it is done in hardware so you will not loose any
performance with ECC.

Quote:> Parity operates just as fast as non-parity memory. It can detect one
> bit errors but not correct them.

Yes, that as also done in hardware.

Quote:> Therefore if you got a parity error, you were basically
> dead. The machine crashed (I think the memory controller
> generated a NMI and shut down the processor).

It generates an NMI, however, it's up to the OS to shut down. Linux
doesn't shut down, it only gives a message in syslog. It might not be a
bad idea to shut down, a bad memory could cause even worse things to
happen.

Quote:> How common are these errors?

I have seen parity errors and ecc errors on both PCs and Suns. However,
this is not always because of bad memory. Just as often it has been
because of a bad motherboard or oxide on the simms.

Quote:> If there is any pattern to them, you probably have a hardware
> problem. The operating system can keep records of this. Linux
> can, I believe, and my motherboard actually keeps some
> records in some buffers in the BIOS I think which I can check
> on a reboot. I don't actually know how to get Linux to do it though.
> Anyone?

No, I don't know how to make Linux do this.

Quote:> I think I probably could have gotten away without using ECC,
> but then again, maybe the ECC memory I bought is a bit higher
> quality, and that's why I haven't gotten any errors. If I do
> in the future, they would presumably will be corrected,
> too -- I am not looking forward to the day, but I'm not very
> worried about it either ;)

And best of all, you know that you can trust your memory and your
program results.

regards Henrik
--
spammer strikeback:


 
 
 

ECC or non/ECC Memory

Post by david kahan » Fri, 16 Oct 1998 04:00:00




>> I think that there is some time overhead
>> for ECC relative to non ECC RAM. The ECC
>> algorithm takes some time to execute.

>  As far as I know it is done in hardware so you will not loose any
>  performance with ECC.

Yes, for sure it's done in hardware, otherwise the time cost
would presumably be horrific. But hardware runs at a finite
speed, and I had heard that ECC was a bit slower than a simple
parity check, which is a pretty trivial thing. On a read of
a 64-bit + 8-bit block with ECC, you have to compute some
7-bit checksum for the block and compare it against the stored
7-bit checksum, as well as read out the whole data block. For
parity it's more or less the same, you have to read out and
do a comparison too, but the computation is seemingly
much simpler.

With ECC you know which bit has changed if there is an error,
whereas with parity you don't. There must be some cost for that,
if not in time, then in extra transistors.

I thought it is similar to the way some complex instructions
(on non-RISC machines) can take more cycles to execute in
the processor than for example a simple integer addition does.

Maybe I'm wrong, though, and it actually is done quickly
enough that the memory doesn't operate any slower. I don't
know for sure. But I seem to remember someone I trusted
to know such things saying it.

Do you know the actual way ECC is done, timings and such?

I really don't, and I'm not trying to be snippety, just
would like to know for sure.

Quote:>> Therefore if you got a parity error, you were basically
>> dead. The machine crashed (I think the memory controller
>> generated a NMI and shut down the processor).

>  It generates an NMI, however, it's up to the OS to shut down. Linux
>  doesn't shut down, it only gives a message in syslog. It might not be a
>  bad idea to shut down, a bad memory could cause even worse things to
>  happen.

Yes you are definitely right about that. I was thinking of
the behaviour under DOS when I said that :) And it's true
enough, too, that it could be better to shut down in some cases.

Quote:>> I think I probably could have gotten away without using ECC,
>> but then again, maybe the ECC memory I bought is a bit higher
>> quality, and that's why I haven't gotten any errors. If I do
>> in the future, they would presumably will be corrected,
>> too -- I am not looking forward to the day, but I'm not very
>> worried about it either ;)

> And best of all, you know that you can trust your memory and your
>  program results.

Absolutely right. For me that is the main advantage.

cheers,

- dave k.

 
 
 

ECC or non/ECC Memory

Post by Henrik Carlqvis » Fri, 16 Oct 1998 04:00:00



> With ECC you know which bit has changed if there is an error,
> whereas with parity you don't. There must be some cost for that,
> if not in time, then in extra transistors.
> Do you know the actual way ECC is done, timings and such?

I would guess that it is all done in hardware without any performance
loss. ECC memory costs some extra and all motherboards don't support
ECC, that makes me guess that you will have to pay some extra to get ECC
support. However, there is one way to find out if noone knows. As I have
ECC memory I could try to run a benchmark like lmbench  without parity
check, with parity check and with ECC. Then we could see if there is any
difference.

regards Henrik
--
spammer strikeback:


 
 
 

ECC or non/ECC Memory

Post by Eric Lee Gre » Sat, 17 Oct 1998 04:00:00





>>I'm building a 350 MHz Pentium based Linux box with an Asus P2B
>>mainboard.  I need to make a decision on wether to use ECC or non/ECC
>>RAM.  Is there any overhead associated with error correction??  What
>>are the pros and cons??  I know that ECC will cost more!!
>ECC memory.  It takes some of the Randomness out of your system.  It is
>not cost effective unless you really have criticle data on your system.

It's not really that much more expensive nowdays, especially if you're
paying PC-100 rates anyhow.

We made a "command decision" a couple of months ago that we were no longer
going to bother with non-ECC PC-100 memory. The cost difference wasn't
that much, and the gains too great. We figure that if somebody is paying
the price for a PII-350 and PC-100 memory, chopping $50 off the cost of the
system by using non-ECC memory is false economy.  

--

"To call Microsoft an innovator is like calling the Pope Jewish ..."
            -- James Love (Consumer Project on Technology)

 
 
 

ECC or non/ECC Memory

Post by david kahan » Sat, 17 Oct 1998 04:00:00




> > With ECC you know which bit has changed if there is an error,
> > whereas with parity you don't. There must be some cost for that,
> > if not in time, then in extra transistors.

> > Do you know the actual way ECC is done, timings and such?

> I would guess that it is all done in hardware without any performance
> loss. ECC memory costs some extra and all motherboards don't support
> ECC, that makes me guess that you will have to pay some extra to get ECC
> support. However, there is one way to find out if noone knows. As I have
> ECC memory I could try to run a benchmark like lmbench  without parity
> check, with parity check and with ECC. Then we could see if there is any
> difference.

Sounds like the best idea. I will try the same thing too, if
I can manage it. I think that not all ECC memory can be
run as parity memory though, since the extra eight bits
used to store the ECC checksum can't always be read
out individually. The only way to see if it will work is
to try it.

Please let me know how it comes out, and I'll do the
same.

cheers,

- dave k.

 
 
 

ECC or non/ECC Memory

Post by david kahan » Mon, 19 Oct 1998 04:00:00





> > > With ECC you know which bit has changed if there is an error,
> > > whereas with parity you don't. There must be some cost for that,
> > > if not in time, then in extra transistors.

> > > Do you know the actual way ECC is done, timings and such?

> > I would guess that it is all done in hardware without any performance
> > loss. ECC memory costs some extra and all motherboards don't support
> > ECC, that makes me guess that you will have to pay some extra to get ECC
> > support. However, there is one way to find out if noone knows. As I have
> > ECC memory I could try to run a benchmark like lmbench  without parity
> > check, with parity check and with ECC. Then we could see if there is any
> > difference.

> Sounds like the best idea. I will try the same thing too, if
> I can manage it. I think that not all ECC memory can be
> run as parity memory though, since the extra eight bits
> used to store the ECC checksum can't always be read
> out individually. The only way to see if it will work is
> to try it.

> Please let me know how it comes out, and I'll do the
> same.

> cheers,

> - dave k.

I downloaded the lmbench package and ran it on my system:
Intel PR440FX, 2 x PPro 200MHz, 8k L1 cache, 256k L2 cache,
cache line size 64 bytes, processors not matched, one has
stepping 07 the other 09. However that is supposed to work
according to Intel ...

My kernel is 2.1.117.

This is a very nice package. I ran it both with ECC enabled in
the BIOS: AMI 1.0.0.8DI0, and with ECC disabled. I have
192 MB of ECC EDO DIMMS installed. I will put the graphs
of memory load latency on my machine at work for download.
(not until this evening).

 ftp://bnlnth.phy.bnl.gov/pub/linux/ecc/ecc.ps.gz

for the ECC enabled results, and:

 ftp://bnlnth.phy.bnl.gov/pub/linux/ecc/noecc.ps.gz

for the non-ECC results.

The processor cycles at 5 nanoseconds. L1 cache
has a latency of about 20 nanoseconds, L2 of 30-40,
depending on the stride.

The conclusion: main memory back to back load
latency is 250 nanoseconds with ECC enabled. The
only stride which differs from this value is stride 16,
at 175 nanoseconds.

With ECC disabled the picture is more complicated.
The latency to main memory is dependent on the
stride. The values generally cluster at 200-220
nanoseconds, with the fastest at 175ns (stride 16 again),
and the slowest at 225ns (stride 1028).

So it looks like there is a time cost for the ECC,
for whatever reason. It amounts to some 50
nanoseconds, or 10 extra processor cycles.

However, on all other system tests, there is no
distinguishable difference between the ECC
enabled and not enabled. Memory bandwidth,
context switches, everything else looks just the
same. So I don't think it will be a serious cost in
normal use.

One anomaly I noticed, is that the L2 cache, which
is supposed to be 256kB, actually seems to degrade
in performance at an array size of 128kB. I don't know
what the cause could be, but it is strange. Maybe a result
of my mismatched processors??

All the tests were run with the system in single user mode,
with a quiet system, no network connections up. That, I
found out makes quite a difference to the results.

cheers,

- dave k.

 
 
 

1. does anyone know about an ecc-application like gnu ecc (but correct?)

GNU ecc has been withdrawn because of bugs quite some time ago, but
does anyone know about a similar program?

The point is to encode a file in such a way that it is protected against
a (user-supplied) amount of damaged blocks/bytes.

Thanks for pointers,
Jurriaan
--
"Bother!" said Pooh, as the Bastard Operator from Hell asked him his username.
Linux 2.2.15pre17 SMP 5 users load av: 0.04 0.11 0.15

2. Problems of fonts in printing with starOOffice ??

3. ecc or ecc+reg

4. Patches needed for NeWSprint + SPARCprinter?

5. How important is ecc for non-server?

6. How to "grep" with a specific line number?

7. ECC memory and SMP lockups on Gateway 6400 server

8. for (i=1, i<y, i++){do something} in sh

9. ECC memory support

10. ECC Memory problems with Linux Install on UDB

11. ECC memory and openbsd 3.2

12. ECC memory not needed?

13. Good ECC memory for Asus P4C800