Hypothetical situation - system stability - feedback requested.

Hypothetical situation - system stability - feedback requested.

Post by Louis W » Sat, 31 Jul 1999 04:00:00



Hi all,

Here's a hypothetical situation extrapolated from a real-life event I would
like some feedback on:

Two parties are in dispute over the allocation of IPC keys for shared
memory, message queues and semaphores (shmget, msgget, semget) on a mission
critical system currently under development for a customer.

Party one holds the view that using fixed key values is safe due to there
being a "2^32 to 1" chance of the same key being (randomly) selected by
another process. The current implementation uses this mechanism.

Party two holds the view that the keys should be allocated by using the
system call ftok() (or any similar API) to ensure that the key is not in use
by any other process at the time of the call, reducing the possibility of a
collision to an absolute minimum, but accepts that the alteration to the
code would require approximately 30 man-hours for change and additional
testing.

Party one claims, to the best of their knowledge, that such a key collision
has not occurred in the past seven years in the system's predecessors which
have used the fixed-key mechanism. However, no proof can be supplied.

Party two claims they can produce evidence, by way of testimony from other
developers, indicating that the collision may have occurred on several
occasions over the last month during unit testing. Physical proof is
currently unavailable due to such faults not being logged.

Points to note:
1> The system is designed to run 24/7 but may be reinitialised (i.e.: cold
started) once a quarter in most instances and once a week in others.
2> The collision would only occur during initialisation of the system.
3> The collision will cause the system to fail, resulting in an unstable
situation where the system is unusable until human intervention alters the
fixed key and reinitialises the system and optionally restarts the process
with which the collision occurred (which may come from a third party vendor
and be unrelated to the system).
4> Should the system fail during a critical period, occurring once a
calendar month for a period of up to 48 hours, the loss of service to the
customer would have a significant, albeit brief, impact.
5> The manual identification and recovery process takes roughly 10 minutes -
once a technician becomes available.
6> The customer is a major multinational and will be using the system across
dozens of platforms in a number of countries.
7> The work, should it be authorised, would be assigned to party one, the
party that wrote the fixed-key implementation.

Which party do you feel has the stronger case?

I await responses with interest ;-)

Louis

 
 
 

Hypothetical situation - system stability - feedback requested.

Post by F.R.M.Barn » Sat, 31 Jul 1999 04:00:00


: Two parties are in dispute over the allocation of IPC keys for shared
: memory, message queues and semaphores (shmget, msgget, semget) on a mission
: critical system currently under development for a customer.

: Party one holds the view that using fixed key values is safe due to there
: being a "2^32 to 1" chance of the same key being (randomly) selected by
: another process. The current implementation uses this mechanism.

: Party two holds the view that the keys should be allocated by using the
: system call ftok() (or any similar API) to ensure that the key is not in use
: by any other process at the time of the call, reducing the possibility of a
: collision to an absolute minimum, but accepts that the alteration to the
: code would require approximately 30 man-hours for change and additional
: testing.

: Party one claims, to the best of their knowledge, that such a key collision
: has not occurred in the past seven years in the system's predecessors which
: have used the fixed-key mechanism. However, no proof can be supplied.

: Party two claims they can produce evidence, by way of testimony from other
: developers, indicating that the collision may have occurred on several
: occasions over the last month during unit testing. Physical proof is
: currently unavailable due to such faults not being logged.

Party two.  The currently implementation is incorrect, although chances
of a collision are minimum, it may happen.  Spending 30 man-hours to
fix the code is worth it.  If a mission critical system contains bugs,
then it is a serious problem.

Fred.
--
+----------------------------------------------------------------------+
| Fred Barnes, UKC                            http://teddy.xylene.com/ |

+----------------------------------------------------------------------+

 
 
 

Hypothetical situation - system stability - feedback requested.

Post by Harald Kirs » Sat, 31 Jul 1999 04:00:00



> Hi all,

> Here's a hypothetical situation extrapolated from a real-life event I would
> like some feedback on:

[story deleted]

Nice story. If you could prove the situation to be exactly as you
described it, I would not recode. Let it happen, and gamble on the
cost of sending the technician --- as long as there is no human life
endangered.

However: Are you really sure that you know the whole story about the
key collision. According the Murphy, the key collision will strike in
a moment and under circumstances where you really don't need it and
where it might obscure more severe errors. While this might sound a
bit hypothetical, I think it is the way things go.

Additionally: If you don't fix it, you have to document it very
thoroughly and carry it over from one version to the next and make
sure that no changes are made to the system which might increase the
probability of key collision, etc. --- its like a litte stone in your
shoe.

Harald Kirsch
--
P.S.: Never ever mail me copies of your posts.
---------------------+---------------------------------------------


 
 
 

Hypothetical situation - system stability - feedback requested.

Post by Louis W » Fri, 06 Aug 1999 04:00:00


Thanks to all that replied to this thread or emailed me!

I'm pleased that the vast majority fell in favour of "party two", with a
preference for stability at a moderate cost despite the improbable
occurrence of the problem - many with sound reasons above and beyond simple
good programming; including "what if the customer decided to take legal
action to recover losses...?". Ekk!

A couple of things entertained/concerned me:
-  IPC keys tend to cluster when assigned via ftok(). This clustering is
believed to be strongly related to the count/size of files on the system by
way of the inode number of the target file. Consequently if the fixed key
happens to fall into this range repeated failures can be expected. If not
then the chances are less likely. The question this leads me to is does this
tendency apply to all mainstream unix flavours?
-  If the restarts are scheduled to take place out of hours, the 'cost'
should also include the 'annoyance factor' of being called out in the middle
of the night. One helpful fellow pointed out that as the system would be
used in several countries, would there be a local technician available who
is aware of the problem...?  Another raised the fun question of "Do the
operators of these other systems speak the same language as your
technicians?" - the honest answer to which is "yea... and toast lands
buttered side up..."  ;-)

Oh, by the way, the 'real life' problem occurred twice this week in testing.
The number of cold starts of the system that took place averaged 26...
Against those odds I think I may play the lottery a bit more ;-)

Louis


| Hi all,
|
| Here's a hypothetical situation extrapolated from a real-life event I
would
| like some feedback on:
|
| Two parties are in dispute over the allocation of IPC keys for shared
| memory, message queues and semaphores (shmget, msgget, semget) on a
mission
| critical system currently under development for a customer.
|
| Party one holds the view that using fixed key values is safe due to there
| being a "2^32 to 1" chance of the same key being (randomly) selected by
| another process. The current implementation uses this mechanism.
|
| Party two holds the view that the keys should be allocated by using the
| system call ftok() (or any similar API) to ensure that the key is not in
use
| by any other process at the time of the call, reducing the possibility of
a
| collision to an absolute minimum, but accepts that the alteration to the
| code would require approximately 30 man-hours for change and additional
| testing.
|
| Party one claims, to the best of their knowledge, that such a key
collision
| has not occurred in the past seven years in the system's predecessors
which
| have used the fixed-key mechanism. However, no proof can be supplied.
|
| Party two claims they can produce evidence, by way of testimony from other
| developers, indicating that the collision may have occurred on several
| occasions over the last month during unit testing. Physical proof is
| currently unavailable due to such faults not being logged.
|
| Points to note:
| 1> The system is designed to run 24/7 but may be reinitialised (i.e.: cold
| started) once a quarter in most instances and once a week in others.
| 2> The collision would only occur during initialisation of the system.
| 3> The collision will cause the system to fail, resulting in an unstable
| situation where the system is unusable until human intervention alters the
| fixed key and reinitialises the system and optionally restarts the process
| with which the collision occurred (which may come from a third party
vendor
| and be unrelated to the system).
| 4> Should the system fail during a critical period, occurring once a
| calendar month for a period of up to 48 hours, the loss of service to the
| customer would have a significant, albeit brief, impact.
| 5> The manual identification and recovery process takes roughly 10
minutes -
| once a technician becomes available.
| 6> The customer is a major multinational and will be using the system
across
| dozens of platforms in a number of countries.
| 7> The work, should it be authorised, would be assigned to party one, the
| party that wrote the fixed-key implementation.
|
| Which party do you feel has the stronger case?
|
| I await responses with interest ;-)
|
| Louis
|
|
|
|
|

 
 
 

1. Hypothetical Situation...

     Let's assume that you're working as a systems administrator or software
developer, building Win32 "solutions".  Your salary is $75,000 a year.  You
use Linux at home, but haven't found many places that will employ a Linux
developer at this time.  Blue screens and work-arounds are an every day
occurrance.

     Then, one day, a strange man shows up at your door with a black box.  He
informs you that every time you press the button on the box, somebody dies,
and you get $10,000...  Wait, wrong situation...

     Anyways, the aforementioned strange man comes to your house, and tells
you that he's interested in funding an open-source project.  He'd like you
to work on it, but can only pay you $35,000 a year.  You would, however,
have very few (if any) deadlines, would be able to choose your development
environment and tools (as long you keep portability in mind), and could
work from your home.

     The question is...  Would you take the job, and give up the big bucks
in order to hack for a living?

     --
     Michael Chisari
     Beyond The Web

2. Java Web Server 1.1

3. Hypothetical situation...

4. Patching Solaris 8 Intel fails, corrupt patch?

5. Hypothetical Situation...

6. SNMP for Linux

7. Request feedback on a PPro-200 system configuration

8. X.25 w/LINUX ?

9. Requesting feedback on technology - 1000 requests per second?

10. WWW: Request for Feedback on WWW site on Linux distributions

11. Request for Linux feedback

12. Linux Quake Howto - feedback requested

13. Embedded GUI versions - Feedback requested