Here's a hypothetical situation extrapolated from a real-life event I would
like some feedback on:
Two parties are in dispute over the allocation of IPC keys for shared
memory, message queues and semaphores (shmget, msgget, semget) on a mission
critical system currently under development for a customer.
Party one holds the view that using fixed key values is safe due to there
being a "2^32 to 1" chance of the same key being (randomly) selected by
another process. The current implementation uses this mechanism.
Party two holds the view that the keys should be allocated by using the
system call ftok() (or any similar API) to ensure that the key is not in use
by any other process at the time of the call, reducing the possibility of a
collision to an absolute minimum, but accepts that the alteration to the
code would require approximately 30 man-hours for change and additional
Party one claims, to the best of their knowledge, that such a key collision
has not occurred in the past seven years in the system's predecessors which
have used the fixed-key mechanism. However, no proof can be supplied.
Party two claims they can produce evidence, by way of testimony from other
developers, indicating that the collision may have occurred on several
occasions over the last month during unit testing. Physical proof is
currently unavailable due to such faults not being logged.
Points to note:
1> The system is designed to run 24/7 but may be reinitialised (i.e.: cold
started) once a quarter in most instances and once a week in others.
2> The collision would only occur during initialisation of the system.
3> The collision will cause the system to fail, resulting in an unstable
situation where the system is unusable until human intervention alters the
fixed key and reinitialises the system and optionally restarts the process
with which the collision occurred (which may come from a third party vendor
and be unrelated to the system).
4> Should the system fail during a critical period, occurring once a
calendar month for a period of up to 48 hours, the loss of service to the
customer would have a significant, albeit brief, impact.
5> The manual identification and recovery process takes roughly 10 minutes -
once a technician becomes available.
6> The customer is a major multinational and will be using the system across
dozens of platforms in a number of countries.
7> The work, should it be authorised, would be assigned to party one, the
party that wrote the fixed-key implementation.
Which party do you feel has the stronger case?
I await responses with interest ;-)