I'm trying to use Sun Grid Engine 5.3 (aka gridware) on a "cluster"
(ethernet LAN) of about 100 (more coming) dual processor machines
running Solaris x86 version 8.
I consistently have problems with gridware, NFS, or both, even when I
try to eliminate use of NFS to zero as far as my application is
concerned (the gridware itself operates on an NFS mounted partition,
so no matter what I do NFS is never entirely out of the picture).
I'm running array jobs (e.g. qsub -t 1:1000) of between 10-1000
elements (ultimately I would like to scale this up to 10000 or
more...I need to know what the limits are too). Jobs of 10 elements
complete without fail (usually). Recently I have been unable to get
100 elements to complete successfully, though I did once get a 1000
element job to complete so I know it's not impossible.
On the current testbed, home directories are NFS mounted, but I've
modified my very complicated application so that it doesn't read any
files from the home directory (normally it reads a key file, startup
command file, and then searches for scripts). The test involves only
starting the application from a locally mounted disk (full pathnames
used to prevent path searching), and writing the date to a file, which
is then copied back to the submit machine using an explicit rcp in the
script itself.
I've worked it to two fairly consistent failure scenarios for 100+
element jobs:
1) Allowing stdout and stderr files to get copied back to the home
directory on the "master" machine (serving as both submit host and NFS
host for the entire "cluster"), Gridware usually chokes on at least
one of the array elements (typically 2 out of 100). The failed jobs
appear in the QMON Job Control GUI in the "pending" jobs (i.e., they
were never scheduled). Clicking on "why" brings up a list of how
gridware tried to run the job on every available "queue" several times
then gave up. This is stupid because the whole point of having an
array of jobs was to have Gridware take care of scheduling them as
resources become available. Besides, we acutally have more processors
than jobs, and it still fails. But the second scenario suggests that
the problem here may actually be "beneath" gridware (such as in NFS,
for example).
2) Supressing the stdout and stderr files (putting -o /dev/null and -e
/dev/null into the qsub command), no jobs ever get stuck as
unscheduled in the "pending" list, even with 1000 elements. But then
what happens is that some jobs simply never finish. And the problem
here is I can't tell why, since there are no stdout or stderr files to
look at. But because no jobs ever get stuck as "unscheduled" in this
case (which reduces NFS traffic considerably) it appears as I said
above that the failure of scheduling in case 1 above is really because
of some sort of NFS problem, such as a timeout, caused by all the
stdout and stderr files being written to the NFS mounted home
directory.
I'm stuck between a rock and hard place here.
One sysadmin says flatly that NFS is incompatible with parallel
processing. If true, this is unfortunate because it's a pain to have
to write scripts to copy all data files and the scripts (and perhaps
programs) that work on them to every machine, copy the results back,
and then delete all the working files. But this is actually what I do
in the tests above, particularly when I'm supressing the stdout and
stderr files. And I still get no satisfaction.
The other sysadmin who installed gridware has the gridware itself
installed on an NFS mounted partition. I don't know if that's
necessary, I think he may have just done that for his own convenience.
Perhaps he'll have to do it some other way in order to make this
thing work. Maybe if we remove all NFS mounts we'll start to get some
successful results. I guess I could tolerate that, but I still think
it stinks. At minimum having user's home directories be NFS mounted
allows them to painlessly have interactive qrsh sessions. Otherwise
they're forever bogged down in ensuring consistency between all their
aliases, scripts, data, etc. I know, I've been there. I think that's
why NFS was invented in the first place.
I've had similar results with my own distributed resource manager on
Solaris x86 last year. I frequently found that NFS mounted files
"disappeared" when jobs tried to access them. Even critical
directories would "disappear" randomly. (Typical system error was
stat failure.) I moved my application away from NFS, quit using SSH
(that was a huge source of trouble), and ultimately things got fairly
reliable, I could launch about 10000 jobs successfully, or maybe have
one mysterious error out of 10000 jobs (I had to allow for the restart
of unsuccessful jobs in my application...which also ran my resource
manager...and I was hoping not to have to do that sort of thing with
gridware). Ironically, I never had to abandon NFS entirely, in fact I
relied on NFS to copy the results back to their intended location, and
that part seemed to work OK. But relying on NFS to access my large
application program and it's files was definitely out. They had to be
copied to local directories on each machine.
Now I remember back not too long ago you could pretty much rely on
NFS. Even if the NFS server went down for a few minutes, the
application would simply wait until it came back up and continue on as
if nothing had happened. Personally, I consider that to be the
correct behavior for NFS. If an application has to be
restarted...that causes lots of hassle for the application program.
Interactive users can simply use CTRL-C if it bugs them, provided the
shell itself isn't NFS mounted (which I have also seen).
But Sun has deemed that the needs of interactive users is more
important, or that their method of starting programs though memory
mapping, or something, has made the old "reliable" way of doing NFS
obsolete. NFS seems to have become a "best effort in 1 sec" sort of
system, if it can't get the file requested in fairly short order
there's a timeout, and the resulting code, even system code, may
consider this to mean that the file doesn't even exist. For
non-interactive applications this is absolutely the wrong behavior
IMVHO.
I'm wondering whether we might be better off ditching Solaris x86 and
using netBSD instead. I wouldn't be surprised if NFS is still a
"reliable" system on netBSD. A reliable NFS system would make
everything easier. We've been seriously considering trying netBSD.
Charles Peterson
Lead Programmer on SOLAR