> I'd like to hear some of the threads community's thoughts on implementing
> robust mutexes in shared memory. I've read Dave's comments in the FAQ (Q88)
> http://www.lambdacs.com/FAQ.html#Q88
> but I don't want to give up on this problem (he says it's "impossible").
> I'm not proposing a solution (I don't have one), but I would like to put
> forth some thoughts on the problem in hopes someone else may have a
> solution.
Nothing's "impossible", but I've gotten tired of explaining all the nuances. To
recover from a failure in this situation you need ABSOLUTE knowledge of what
each component was doing. You need to be able to "retire" the locked mutex,
making all active threads switch to a new mutex. You need to be able to
determine the proper value of each byte of storage that the failing thread
might possibly have touched, and restore it, while preventing any active thread
from using that suspect data. This is a VERY tall order. Impossible? No, not
quite. At least, not in an embedded system where you're not using a kernel or
any library code over which you don't have total and absolute control.
Quote:> The main issue I'm concerned with, though, is the possibility that a process
> could die while holding a mutex lock. Dave says you'd have to write your
> own implementation of a read lock to allow a third party process to monitor
> processes claiming to have the lock. But at the end of the day, you're
> going to have to use a pthread primitive somewhere. So the question is,
> how do you unlock a mutex when the process that locked it dies with it
> locked? Overwrite it with zeros?
Now THAT'S impossible. (At least, illegal, extremely dangerous, and completely
non-portable -- and unlikely to work on many, if any, implementations.) Once
you've got a dead thread with a mutex, forget about prying the mutex from its
fingers. The mutex is gone, poof. If your application can't be made to deal
with that, forget the mutex. A semaphore is a perfectly reasonable alternative
-- anyone can "unlock" a semaphore.
Quote:> Maybe you could use a semaphore with a maximum count of one. It would seem
> those are easier to steal. Once a determination was made that the owner
> process is no longer around (see below), then the babysitter process or
> thread making this determination could decrement the semaphore on the
> deceased's behalf.
You can't "steal" a semaphore, because they're designed to be open and sharing
and all that. Yeah, anyone can post a semaphore for which you've got an
address. No big deal, and completely routine. But that's still down in the
"trivial" part. The hard part is dealing with the rest of the shared data --
NOT recovering from the synchronization failure. The thread DIED. It did
something you didn't expect. (And if anyone really did design a thread to
deliberately terminate while holding synchronization resources, I would still
consider anything it does extremely suspect!) Now you're trying to guess what
consequences that might have to your application. As I said, unless you can
examine and validate every byte of shared data, forget it. The iceberg won, and
your boat has sunk. Deal with it.
Quote:> Then there's the issue of determining whether a process claiming ownership
> of the lock really has it. This one is tough. To get anywhere, you'd at
> least nead a PID that you can try to kill(2) with "0" to test its process'
> existence, but how do you protect the four bytes of PID from being read
> while they were in the process of being written? And what happens if the
> process dies after it gets the lock, but before it writes its PID there?
> Then, if you could "kill" the other process, how can you be sure it's not
> someone's shell or something that happened to reuse the same PID as the
> process which may have died a month ago?
OK, if you're talking about two PROCESSES using shared memory, you might get
somewhere with this. But it's much more fun to consider two THREADED processes
using shared memory. You cannot validate (or see) a thread ID within another
process. Threads within the process that don't even KNOW about the shared
memory can corrupt the data just as well as threads that know about it. Now
isn't that a lot more fun?
Quote:> I'm going to do some investigation into all the things in Unix that are
> totally fail-safe and try to combine them in such a way as to provide robust
> shared mutexes. For instance, a record lock on a file will _always_ be
> removed when the locking process dies. That's the sort of thing I'm looking
> for ways to take advantage of.
There are some things, as the example that you pointed out, which are
"reasonably safe under normal circumstances". But kernels aren't always exactly
bug-free, either, you know, and "totally fail-safe" is a pretty strong term.
Would you bet your life on it, literally? No, I didn't think so.
If what you want is "less than absolutely 100% guaranteed totally fail-safe",
then where's your dividing line? Maybe a mutex and shared memory is "good
enough", because the chances of one process blowing up with a mutex held are
"manageably low" in the context of immediate concern. Maybe your
semaphore/manager idea is "good enough". There's no one absolute answer for all
possible circumstances.
Quote:> Any ideas?
Yeah. Give up. ;-)
Seriously, if what you want is "reasonably fail-safe" communication between
parallel entities, forget about the performance advantages of shared memory and
mutexes. They're in a different world, and you can't live in both. Stick with
separate address spaces, and pass messages between them. You can validate the
messages, and ignore bad data. If one partner dies, you can detect that and
recover, with no chance of corruption to the surviving partners. (Unless, of
course, the partners are running the same code, in which case they have the
same bug just waiting for the right time to visit.)
/---------------------------[ Dave Butenhof ]--------------------------\
| 110 Spit Brook Rd ZKO2-3/Q18 http://members.aol.com/drbutenhof |
| Nashua NH 03062-2698 http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-----------------[ Better Living Through Concurrency ]----------------/