Gridware difficulties and NFS difficulties (?)

Gridware difficulties and NFS difficulties (?)

Post by Charles Peters » Sun, 02 Feb 2003 11:47:21



I'm trying to use Sun Grid Engine 5.3 (aka gridware) on a "cluster"
(ethernet LAN) of about 100 (more coming) dual processor machines
running Solaris x86 version 8.

I consistently have problems with gridware, NFS, or both, even when I
try to eliminate use of NFS to zero as far as my application is
concerned (the gridware itself operates on an NFS mounted partition,
so no matter what I do NFS is never entirely out of the picture).

I'm running array jobs (e.g. qsub -t 1:1000) of between 10-1000
elements (ultimately I would like to scale this up to 10000 or
more...I need to know what the limits are too).  Jobs of 10 elements
complete without fail (usually).  Recently I have been unable to get
100 elements to complete successfully, though I did once get a 1000
element job to complete so I know it's not impossible.

On the current testbed, home directories are NFS mounted, but I've
modified my very complicated application so that it doesn't read any
files from the home directory (normally it reads a key file, startup
command file, and then searches for scripts).  The test involves only
starting the application from a locally mounted disk (full pathnames
used to prevent path searching), and writing the date to a file, which
is then copied back to the submit machine using an explicit rcp in the
script itself.

I've worked it to two fairly consistent failure scenarios for 100+
element jobs:

1) Allowing stdout and stderr files to get copied back to the home
directory on the "master" machine (serving as both submit host and NFS
host for the entire "cluster"), Gridware usually chokes on at least
one of the array elements (typically 2 out of 100).  The failed jobs
appear in the QMON Job Control GUI in the "pending" jobs (i.e., they
were never scheduled).  Clicking on "why" brings up a list of how
gridware tried to run the job on every available "queue" several times
then gave up.  This is stupid because the whole point of having an
array of jobs was to have Gridware take care of scheduling them as
resources become available.  Besides, we acutally have more processors
than jobs, and it still fails.  But the second scenario suggests that
the problem here may actually be "beneath" gridware (such as in NFS,
for example).

2) Supressing the stdout and stderr files (putting -o /dev/null and -e
/dev/null into the qsub command), no jobs ever get stuck as
unscheduled in the "pending" list, even with 1000 elements.  But then
what happens is that some jobs simply never finish.  And the problem
here is I can't tell why, since there are no stdout or stderr files to
look at.  But because no jobs ever get stuck as "unscheduled" in this
case (which reduces NFS traffic considerably) it appears as I said
above that the failure of scheduling in case 1 above is really because
of some sort of NFS problem, such as a timeout, caused by all the
stdout and stderr files being written to the NFS mounted home
directory.

I'm stuck between a rock and hard place here.

One sysadmin says flatly that NFS is incompatible with parallel
processing.  If true, this is unfortunate because it's a pain to have
to write scripts to copy all data files and the scripts (and perhaps
programs) that work on them to every machine, copy the results back,
and then delete all the working files.  But this is actually what I do
in the tests above, particularly when I'm supressing the stdout and
stderr files.  And I still get no satisfaction.

The other sysadmin who installed gridware has the gridware itself
installed on an NFS mounted partition.  I don't know if that's
necessary, I think he may have just done that for his own convenience.
 Perhaps he'll have to do it some other way in order to make this
thing work.  Maybe if we remove all NFS mounts we'll start to get some
successful results.  I guess I could tolerate that, but I still think
it stinks.  At minimum having user's home directories be NFS mounted
allows them to painlessly have interactive qrsh sessions.  Otherwise
they're forever bogged down in ensuring consistency between all their
aliases, scripts, data, etc.  I know, I've been there.  I think that's
why NFS was invented in the first place.

I've had similar results with my own distributed resource manager on
Solaris x86 last year.  I frequently found that NFS mounted files
"disappeared" when jobs tried to access them.  Even critical
directories would "disappear" randomly.  (Typical system error was
stat failure.)  I moved my application away from NFS, quit using SSH
(that was a huge source of trouble), and ultimately things got fairly
reliable, I could launch about 10000 jobs successfully, or maybe have
one mysterious error out of 10000 jobs (I had to allow for the restart
of unsuccessful jobs in my application...which also ran my resource
manager...and I was hoping not to have to do that sort of thing with
gridware).  Ironically, I never had to abandon NFS entirely, in fact I
relied on NFS to copy the results back to their intended location, and
that part seemed to work OK.  But relying on NFS to access my large
application program and it's files was definitely out.  They had to be
copied to local directories on each machine.

Now I remember back not too long ago you could pretty much rely on
NFS.  Even if the NFS server went down for a few minutes, the
application would simply wait until it came back up and continue on as
if nothing had happened.  Personally, I consider that to be the
correct behavior for NFS.  If an application has to be
restarted...that causes lots of hassle for the application program.
Interactive users can simply use CTRL-C if it bugs them, provided the
shell itself isn't NFS mounted (which I have also seen).

But Sun has deemed that the needs of interactive users is more
important, or that their method of starting programs though memory
mapping, or something, has made the old "reliable" way of doing NFS
obsolete.  NFS seems to have become a "best effort in 1 sec" sort of
system, if it can't get the file requested in fairly short order
there's a timeout, and the resulting code, even system code, may
consider this to mean that the file doesn't even exist.  For
non-interactive applications this is absolutely the wrong behavior
IMVHO.

I'm wondering whether we might be better off ditching Solaris x86 and
using netBSD instead.  I wouldn't be surprised if NFS is still a
"reliable" system on netBSD.  A reliable NFS system would make
everything easier.  We've been seriously considering trying netBSD.

Charles Peterson
Lead Programmer on SOLAR

 
 
 

Gridware difficulties and NFS difficulties (?)

Post by Rich Tee » Sun, 02 Feb 2003 13:20:58



Quote:> But Sun has deemed that the needs of interactive users is more
> important, or that their method of starting programs though memory
> mapping, or something, has made the old "reliable" way of doing NFS
> obsolete.  NFS seems to have become a "best effort in 1 sec" sort of
> system, if it can't get the file requested in fairly short order
> there's a timeout, and the resulting code, even system code, may
> consider this to mean that the file doesn't even exist.  For
> non-interactive applications this is absolutely the wrong behavior
> IMVHO.

I don't know if this'll help, but whan NFS share/mount options
are you using?

--
Rich Teer

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-online.net

 
 
 

Gridware difficulties and NFS difficulties (?)

Post by Richard L. Hamilt » Mon, 03 Feb 2003 01:21:15




Quote:> I'm trying to use Sun Grid Engine 5.3 (aka gridware) on a "cluster"
> (ethernet LAN) of about 100 (more coming) dual processor machines
> running Solaris x86 version 8.
[...]
> Now I remember back not too long ago you could pretty much rely on
> NFS.  Even if the NFS server went down for a few minutes, the
> application would simply wait until it came back up and continue on as
> if nothing had happened.  Personally, I consider that to be the
> correct behavior for NFS.  If an application has to be
> restarted...that causes lots of hassle for the application program.
> Interactive users can simply use CTRL-C if it bugs them, provided the
> shell itself isn't NFS mounted (which I have also seen).

Make sure that the mount is with the "hard" option, and absolutely NOT
with the "soft" option.  From mount_nfs(1m):

           hard | soft
                 Continue to  retry  requests  until  the  server
                 responds  (hard)  or give up and return an error
                 (soft).  The default value is hard.

For CTRL-C to work with "hard", one should also add the "intr" option.

           intr | nointr
                 Allow (do not allow) keyboard interrupts to kill
                 a  process  that  is  hung  while  waiting for a
                 response on  a  hard-mounted  file  system.  The
                 default  is  intr,  which  makes it possible for
                 clients to interrupt applications  that  may  be
                 waiting for a remote mount.

Note that both should be the default, so as long as neither "soft" nor
"nointr" are specified as mount options, that shouldn't be the problem.

--

 
 
 

Gridware difficulties and NFS difficulties (?)

Post by Thomas Deh » Mon, 03 Feb 2003 03:22:02



> I'm trying to use Sun Grid Engine 5.3 (aka gridware) on a "cluster"
> (ethernet LAN) of about 100 (more coming) dual processor machines
> running Solaris x86 version 8.

> I consistently have problems with gridware, NFS, or both, even when I
> try to eliminate use of NFS to zero as far as my application is
> concerned (the gridware itself operates on an NFS mounted partition,
> so no matter what I do NFS is never entirely out of the picture).

> I'm running array jobs (e.g. qsub -t 1:1000) of between 10-1000
> elements (ultimately I would like to scale this up to 10000 or
> more...I need to know what the limits are too).  Jobs of 10 elements
> complete without fail (usually).  Recently I have been unable to get
> 100 elements to complete successfully, though I did once get a 1000
> element job to complete so I know it's not impossible.

I think that for such a huge setup I'd negotiate a grid engine
"software only" support contract and then let Sun sort it out.

Until I find somebody actually willing to
sell me such a support contract,
I'd ask such questions in Sun's
grid engine web forum.

http://wwws.sun.com/software/gridware/support.html

Thomas

 
 
 

Gridware difficulties and NFS difficulties (?)

Post by Philip Bro » Wed, 05 Feb 2003 08:12:45



Quote:>...
>I've had similar results with my own distributed resource manager on
>Solaris x86 last year.  I frequently found that NFS mounted files
>"disappeared" when jobs tried to access them.  Even critical
>directories would "disappear" randomly.

sounds like not time-synced tightly enough.

Quote:> NFS seems to have become a "best effort in 1 sec" sort of
>system, if it can't get the file requested in fairly short order
>there's a timeout, and the resulting code, even system code, may
>consider this to mean that the file doesn't even exist.  For
>non-interactive applications this is absolutely the wrong behavior
>IMVHO.

This sort of thing should be tunable with nfs mount options, I would guess.

Your "new behaviour" is probably a side effect of the nfsv3/TCP transition.

But additionally, I would do throughput testing of each solaris x86
machine, to make sure it is properly negotiating with the switch correctly.
Lots o times, you can have a full-duplex/half-duplex messup, just like
sparcs.
That will mess with your NFS, in strange ways.

--
http://www.blastwave.org/ for solaris pre-packaged binaries with pkg-get
  Sign up to maintain a package for your own favourite software!
[Trim the no-bots from my address to reply to me by email!]

http://www.spamlaws.com/state/ca1.html

 
 
 

1. nis nfs difficulties solaris 8

Hi:
     I am working on setting up some new computers for our research
lab, and have run into some problems - for some reason, the computers
I am adding cannot a) login via NIS, or b) get past the defaultrouter.
     Here's what our lab consists of:
          6 Sun Blade 100's running Solaris 8
               (lucy, edmund, susan, tumnus, aslan, peter)
          1 fileserver running Suse Linux
               (wardrobe)
          + 2 new Sun Blade 150's running Solaris 8
               (whitewitch, mrbeaver)

The 6 SB100's and the fileserver work fine, it's just the two new
computers that are having problems. The login system is supposed to be
NIS, and when I run ypwhich, it reports the hostname of the fileserver
(wardrobe). The /etc/hosts file looks just like the other 6 computers,
listing localhost, thiscomputer, and wardrobe. The /etc/hostname and
/etc/hostname.eri0 files are set to the correct host.

As far as the default router issue goes, the /etc/defaultrouter is set
to 10.22.0.254, just like the other computers. Running "route get
default" reports the router (10.22.0.254), same as other computers.

The only other thing that I notice is that the other computers can
resolve the hostnames of other machines in the lab (lucy, edmund,
peter...), but the new machines can only resolve what is listed in
/etc/hosts.

Any ideas? I'm really stuck. Thanks a lot!

2. Sendmail problems (solved)

3. nfs & mount -a difficulties

4. Deleting files / useful aliases (was TCSH CSH BASH.....)

5. nfs difficulties

6. Linux too SLOW in accessing DOS FS

7. NFS server difficulties

8. ***Publisher Looking for Authors***

9. difficulty nfs exporting to sun

10. NFS difficulties

11. 486 PCI & NE2000 Difficulties

12. A Hardware Difficulty

13. PPP module difficulty