LIGO Support Ticket 1658

Ticket Information
  Number:      support 1658
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,dbrown__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: wenger
Date: Tue, 25 Jul 2006 15:17:29 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Brown Duncan
 <duncan__AT__gravity.phys.uwm.edu>
Subject: LIGO problem with condor_run 6.8.0

The LIGO Caltech cluster was upgraded to 6.8.0 today and we now have the
following fatal problem with condor_run:

$ condor_run ls
Condor does not have write permission to this directory.

This is not a unix file permission problem.
What is happening is that the condor_run perl script is picking up the
following warning message from condor_submit:

$ condor_submit .condor_submit.30102 
Submitting job(s)
WARNING: Log file /archive/home/anderson/.condor_log.30102 is on NFS.
This could cause log file corruption and is _not_ recommended.
.
Logging submit event(s).
1 job(s) submitted to cluster 5883752.


and over interpretting this warning as a fatal error in the following lines
of code:


# submit the job; $cluster contains cluster number if successful
open(SUBMIT, "condor_submit .condor_submit.$$ 2>&1 |") ||
    &abort("Failed to run condor_submit.  Please check your path.\n");
while(<SUBMIT>) {
    if (/^1 job\(s\) submitted to cluster (\d+)./) {
        ($cluster) = $1;
    } elsif (/WARNING/) {
       &abort("Condor does not have write permission to this directory.\n");
    } else {
        $submit_errors .= $_;
    }
}


furthermore, even if this should be a fatal condition it is printing
the wrong and misleading error message to the users.


Commenting out the "elseif WARNING/abort" lines in condor_run, lets condor_run
work again from NFS mounted home directories.


I believe it was LIGO that asked for this NFS warning message, and that it
be escalated to a fatal error condition in a future versin of Condor,
however, I thought the concern (and hence the warning/fatal error) only
applied to dagman?

Is there an NFS file locking problem with all condor log files?

At any rate, the convesion of all WARNING messages to abort(...write permission)
should probably be fixed.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Tue Jul 25 17:25:16 2006 (1153866318)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Wed Jul 26  9:02:00 2006 (1153922520)
Date: Wed, 26 Jul 2006 09:34:26 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> The LIGO Caltech cluster was upgraded to 6.8.0 today and we now have the
> following fatal problem with condor_run:
>
> $ condor_run ls
> Condor does not have write permission to this directory.
>
> This is not a unix file permission problem.
> What is happening is that the condor_run perl script is picking up the
> following warning message from condor_submit:
>
> $ condor_submit .condor_submit.30102
> Submitting job(s)
> WARNING: Log file /archive/home/anderson/.condor_log.30102 is on NFS.
> This could cause log file corruption and is _not_ recommended.
> .
> Logging submit event(s).
> 1 job(s) submitted to cluster 5883752.
>
>
> and over interpretting this warning as a fatal error in the following lines
> of code:
>
>
> # submit the job; $cluster contains cluster number if successful
> open(SUBMIT, "condor_submit .condor_submit.$$ 2>&1 |") ||
>     &abort("Failed to run condor_submit.  Please check your path.\n");
> while(<SUBMIT>) {
>     if (/^1 job\(s\) submitted to cluster (\d+)./) {
>         ($cluster) = $1;
>     } elsif (/WARNING/) {
>        &abort("Condor does not have write permission to this directory.\n");
>     } else {
>         $submit_errors .= $_;
>     }
> }
>
>
> furthermore, even if this should be a fatal condition it is printing
> the wrong and misleading error message to the users.
>
>
> Commenting out the "elseif WARNING/abort" lines in condor_run, lets condor_run
> work again from NFS mounted home directories.
>
>
> I believe it was LIGO that asked for this NFS warning message, and that it
> be escalated to a fatal error condition in a future versin of Condor,
> however, I thought the concern (and hence the warning/fatal error) only
> applied to dagman?
>
> Is there an NFS file locking problem with all condor log files?

Yes, it's just that the errors don't always show up unless something is
trying to read the log file(s), and DAGMan is pretty much the only thing
(that we know of, anyhow) that tries to read the log files.

> At any rate, the convesion of all WARNING messages to abort(...write permission)
> should probably be fixed.

Okay, it sounds like there are several issues here:

1. Should a plain condor_submit generate this warning?
2. If condor_submit does generate a warning, condor_run should recoginize
   that it's a warning, not a fatal error.
3. If condor_submit does generate a warning, condor_run should pass along
   the warning message instead of printing a misleading message.
4. Such a warning from condor_submit will goof up DAGMan.
5. The check is not yet in DAGMan itself.

#1: I think it *does* make sense for condor_submit to generate the
warning. I tend to think that, for a plain condor_submit, it shouldn't be
a fatal error, though (assuming we have DAGMan generate a fatal error in
this case).

At this point, there's no way to turn off the warning from condor_submit,
which is probably not good.

#2 and #3: Condor_run obviously has to be changed to deal with warnings
from condor_submit better.

#4 is kind of good and bad -- it does mean that DAGMan should fail if
you have log files on NFS.  But instead of getting the failure right
at the beginning when it checks the log files, you'll get the failure
when DAGMan tries to submit the offending job.  The other problem is that
if there's some *other* warning from condor_submit that we don't want to
be fatal, it will still goof things up.  We already have a PR (711) on
this one.

#5: this should be fixed within a day or two.

Let me know if you disagree with any of the above...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Jul 26  9:34:33 2006 (1153924474)
Date: Wed, 26 Jul 2006 14:54:02 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

It turns out that I was wrong in part of my earlier answer -- getting the
NFS warning from condor_submit will actually *not* goof up DAGMan.  There
was a PR for this, but it turns out that DAGMan continues after the
condor_submit warning.

The bad thing is that DAGMan actually hides the warning from condor_submit
-- you never even see it in the dagman.out file.

Anyhow, I'm working on fixing this, and I will let you know when the
DAGMan-level check for NFS log files is in place, too.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Jul 26 14:54:35 2006 (1153943675)
Date: Wed, 26 Jul 2006 21:52:04 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

On Wed, Jul 26, 2006 at 09:34:33AM -0600, condor-support response tracking system wrote:
> >
> > I believe it was LIGO that asked for this NFS warning message, and that it
> > be escalated to a fatal error condition in a future versin of Condor,
> > however, I thought the concern (and hence the warning/fatal error) only
> > applied to dagman?
> >
> > Is there an NFS file locking problem with all condor log files?
> 
> Yes, it's just that the errors don't always show up unless something is
> trying to read the log file(s), and DAGMan is pretty much the only thing
> (that we know of, anyhow) that tries to read the log files.

Sorry to be pedantic, but are you saying you have seen problems with
non-dagman Condor jobs that write their logs to NFS mounts even when
there are _no_ processes trying to read the log file?

> 
> > At any rate, the convesion of all WARNING messages to abort(...write permission)
> > should probably be fixed.
> 
> Okay, it sounds like there are several issues here:
> 
> 1. Should a plain condor_submit generate this warning?
> 2. If condor_submit does generate a warning, condor_run should recoginize
>    that it's a warning, not a fatal error.
> 3. If condor_submit does generate a warning, condor_run should pass along
>    the warning message instead of printing a misleading message.
> 4. Such a warning from condor_submit will goof up DAGMan.
> 5. The check is not yet in DAGMan itself.
> 
> #1: I think it *does* make sense for condor_submit to generate the
> warning. I tend to think that, for a plain condor_submit, it shouldn't be
> a fatal error, though (assuming we have DAGMan generate a fatal error in
> this case).

I agree on both counts. Is the DAGMan work to make this optionally a
fatal error going to be allowed in the stable 6.8.x branch?

> 
> At this point, there's no way to turn off the warning from condor_submit,
> which is probably not good.

If there are NFS client/server implementations out there that have been
demonstrated to robustly perform the file locking that Condor needs then
I agree there should be another knob to disable this warning.

> 
> #2 and #3: Condor_run obviously has to be changed to deal with warnings
> from condor_submit better.

Agreed.

> 
> #4 is kind of good and bad -- it does mean that DAGMan should fail if
> you have log files on NFS.  But instead of getting the failure right
> at the beginning when it checks the log files, you'll get the failure
> when DAGMan tries to submit the offending job.  The other problem is that
> if there's some *other* warning from condor_submit that we don't want to
> be fatal, it will still goof things up.  We already have a PR (711) on
> this one.

I understand from a follow up email that this is not the case and that
it is being fixed to do the "right thing".

> 
> #5: this should be fixed within a day or two.

Fantastic.

> 
> Let me know if you disagree with any of the above...
> 

I think we are in complete agreement. My only remaining question is the
schedule for getting these fixes and changes released, e.g., will they
all be able to make 6.8.1?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Jul 26 23:59:51 2006 (1153976392)
Date: Fri, 28 Jul 2006 10:11:48 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> Sorry to be pedantic, but are you saying you have seen problems with
> non-dagman Condor jobs that write their logs to NFS mounts even when
> there are _no_ processes trying to read the log file?

I'm not sure anyone has seen this, but there's no reason it couldn't
happen.  It's probably more likely with DAGMan, because if you have a
bunch of jobs sharing the same log file, there's more chance of muliple
processes trying to write to the log at the same time.

Mainly, though, you're less likely to *notice* a corrupted log file if
you're not running DAGMan.  There's actually nothing special about the log
file writing inside DAGMan, though -- you could have a whole bunch of non-
DAGMan jobs sharing a log file.  And the problem is possible even with
a single job, because there is more than one process writing to the log
file.

> > > At any rate, the convesion of all WARNING messages to abort(...write permission)
> > > should probably be fixed.
> >
> > Okay, it sounds like there are several issues here:
> >
> > 1. Should a plain condor_submit generate this warning?
> > 2. If condor_submit does generate a warning, condor_run should recoginize
> >    that it's a warning, not a fatal error.
> > 3. If condor_submit does generate a warning, condor_run should pass along
> >    the warning message instead of printing a misleading message.
> > 4. Such a warning from condor_submit will goof up DAGMan.
> > 5. The check is not yet in DAGMan itself.
> >
> > #1: I think it *does* make sense for condor_submit to generate the
> > warning. I tend to think that, for a plain condor_submit, it shouldn't be
> > a fatal error, though (assuming we have DAGMan generate a fatal error in
> > this case).
>
> I agree on both counts. Is the DAGMan work to make this optionally a
> fatal error going to be allowed in the stable 6.8.x branch?

Yes.  I think we will default to this being a fatal error inside DAGMan,
with an option to downgrade it to a warning.

> > At this point, there's no way to turn off the warning from condor_submit,
> > which is probably not good.
>
> If there are NFS client/server implementations out there that have been
> demonstrated to robustly perform the file locking that Condor needs then
> I agree there should be another knob to disable this warning.
>
> >
> > #2 and #3: Condor_run obviously has to be changed to deal with warnings
> > from condor_submit better.
>
> Agreed.

One question about this -- what are you using condor_run for?  I was
discussing this issue with Peter Couvares, and he doesn't think that
condor_run should be used for anything besides a quick manual run
from the command line (e.g., don't build any other scripts on top of
condor_run).

> > #4 is kind of good and bad -- it does mean that DAGMan should fail if
> > you have log files on NFS.  But instead of getting the failure right
> > at the beginning when it checks the log files, you'll get the failure
> > when DAGMan tries to submit the offending job.  The other problem is that
> > if there's some *other* warning from condor_submit that we don't want to
> > be fatal, it will still goof things up.  We already have a PR (711) on
> > this one.
>
> I understand from a follow up email that this is not the case and that
> it is being fixed to do the "right thing".

Well, we didn't want to get into having DAGMan try to parse the
condor_submit warnings.  DAGMan is going to have its own check for log
files on NFS, and if there is one, it will bail out right at the
beginning, even if that job isn't until near the end of the DAG.

> > #5: this should be fixed within a day or two.
>
> Fantastic.
>
> >
> > Let me know if you disagree with any of the above...
> >
>
> I think we are in complete agreement. My only remaining question is the
> schedule for getting these fixes and changes released, e.g., will they
> all be able to make 6.8.1?

Well, I don't know when a full 6.8.1 will be ready (at least some weeks),
but I can get you a pre-release 6.8.1 condor_dagman as soon as the bug
fixes are ready.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jul 28 10:14:44 2006 (1154099685)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,         Erik Espinoza
 <espinoza_e__AT__ligo.caltech.edu>
From: Duncan Brown <dbrown__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0
Date: Fri, 28 Jul 2006 10:23:28 -0700
To: condor-support__AT__cs.wisc.edu

On Jul 28, 2006, at 9:14 AM, condor-support response tracking system  
wrote:
> One question about this -- what are you using condor_run for?  I was
> discussing this issue with Peter Couvares, and he doesn't think that
> condor_run should be used for anything besides a quick manual run
> from the command line (e.g., don't build any other scripts on top of
> condor_run).

This is basically what it is being used for. We have a matlab based  
(hence vanilla) quick look tool that takes a GPS time argument and  
makes various spectrograms of our data. A user typically types

condor_run Qscan 875756796.0

and it runs the job on the cluster while the user waits for the results.

Cheers,
Duncan.

-- 

Duncan Brown                                 California Institute of  
Technology
Tapir: (626) 395 8409                         MS 18-34, Pasadena, CA  
91125, USA
LIGO:  (626) 395 8812                 http://www.lsc- 
group.phys.uwm.edu/~duncan



===========================================================================
Date mail was appended: Fri Jul 28 12:31:04 2006 (1154107865)
Date: Fri, 28 Jul 2006 08:25:55 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

On Fri, Jul 28, 2006 at 10:14:44AM -0600, condor-support response tracking system wrote:
> 
> > > At this point, there's no way to turn off the warning from condor_submit,
> > > which is probably not good.
> >
> > If there are NFS client/server implementations out there that have been
> > demonstrated to robustly perform the file locking that Condor needs then
> > I agree there should be another knob to disable this warning.
> >
> > >
> > > #2 and #3: Condor_run obviously has to be changed to deal with warnings
> > > from condor_submit better.
> >
> > Agreed.
> 
> One question about this -- what are you using condor_run for?  I was
> discussing this issue with Peter Couvares, and he doesn't think that
> condor_run should be used for anything besides a quick manual run
> from the command line (e.g., don't build any other scripts on top of
> condor_run).

I don't know, but a few of users noticed this problem right away and
then I reproduced the problem running a trivial test job.

What is the concern and what is the recomendation? Should custom scripts
be written on top of condor_submit instead? If so, what reason should
I give our users?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Jul 28 12:31:31 2006 (1154107891)
CC: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
From: Duncan Brown <dbrown__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0
Date: Fri, 28 Jul 2006 10:40:05 -0700
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>

Hi Stuart,

On Jul 28, 2006, at 8:25 AM, Stuart Anderson wrote:
>> One question about this -- what are you using condor_run for?  I was
>> discussing this issue with Peter Couvares, and he doesn't think that
>> condor_run should be used for anything besides a quick manual run
>> from the command line (e.g., don't build any other scripts on top of
>> condor_run).
>
> I don't know, but a few of users noticed this problem right away and
> then I reproduced the problem running a trivial test job.
>
> What is the concern and what is the recomendation? Should custom  
> scripts
> be written on top of condor_submit instead? If so, what reason should
> I give our users?

I actually think we should be encouraging users to use condor_run,  
e.g. for launching lalapps_coire post-processing type programs that  
run happily in the vanilla universe, generally take less than 4  
hours, and clog up pcdev*

Cheers,
Duncan.

-- 

Duncan Brown                                 California Institute of  
Technology
Tapir: (626) 395 8409                         MS 18-34, Pasadena, CA  
91125, USA
LIGO:  (626) 395 8812                 http://www.lsc- 
group.phys.uwm.edu/~duncan



===========================================================================
Date mail was appended: Fri Jul 28 12:47:44 2006 (1154108864)
Date: Fri, 28 Jul 2006 13:27:59 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> I don't know, but a few of users noticed this problem right away and
> then I reproduced the problem running a trivial test job.
>
> What is the concern and what is the recomendation? Should custom scripts
> be written on top of condor_submit instead? If so, what reason should
> I give our users?

Here's what Peter has to say about this:

  Because it's not a reliable way to block for the completion
  of a job.  If condor_run goes away mid-job (e.g., because it's
  interrupted or killed, or the submit machine goes down), you no
  longer have a handle on the job.

  It simply doesn't handle failures.  It's meant as a convenience tool
  for interactive, command-line operations, but it should never be used
  as part of a script or automated system, IMHO, because it will only
  work so long as nothing goes wrong.  And automated systems need to be
  able to handle things going wrong.

One alternative is to simply use DAGMan if you want to do anything
more complex that a quick command-line test.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jul 28 13:29:42 2006 (1154111383)
Date: Mon, 11 Sep 2006 17:35:17 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 tannenba__AT__cs.wisc.edu
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 dbrown__AT__ligo.caltech.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

In the pre-release version of 6.8.1 running at CIT condor_run on an
NFS mounted directory no longer generates a warning message. Was this
6.8.0 enhancement removed?

On Fri, Jul 28, 2006 at 01:29:42PM -0600, condor-support response tracking system wrote:
> Stuart,
> 
> > I don't know, but a few of users noticed this problem right away and
> > then I reproduced the problem running a trivial test job.
> >
> > What is the concern and what is the recomendation? Should custom scripts
> > be written on top of condor_submit instead? If so, what reason should
> > I give our users?
> 
> Here's what Peter has to say about this:
> 
>   Because it's not a reliable way to block for the completion
>   of a job.  If condor_run goes away mid-job (e.g., because it's
>   interrupted or killed, or the submit machine goes down), you no
>   longer have a handle on the job.
> 
>   It simply doesn't handle failures.  It's meant as a convenience tool
>   for interactive, command-line operations, but it should never be used
>   as part of a script or automated system, IMHO, because it will only
>   work so long as nothing goes wrong.  And automated systems need to be
>   able to handle things going wrong.
> 
> One alternative is to simply use DAGMan if you want to do anything
> more complex that a quick command-line test.
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,dbrown__AT__ligo.caltech.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Sep 11 19:38:05 2006 (1158021487)
Date: Thu, 14 Sep 2006 12:56:21 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> In the pre-release version of 6.8.1 running at CIT condor_run on an
> NFS mounted directory no longer generates a warning message. Was this
> 6.8.0 enhancement removed?

Hmm, that's kind of strange.  The check was only ever in various 6.8.1
pre-release versions, not 6.8.0.  We changed how strict the checking it
at some point after it was initially implemented, but if your log file
is on NFS, you should at least get a warning (you can configure whether
or not it's a fatal error).

By default, having the log files on NFS is a fatal error for DAGMan, but
not for a "plain" condor_submit.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Sep 14 12:57:34 2006 (1158256655)
Date: Thu, 14 Sep 2006 17:16:38 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> In the pre-release version of 6.8.1 running at CIT condor_run on an
> NFS mounted directory no longer generates a warning message. Was this
> 6.8.0 enhancement removed?

Hmm, it looks like I was wrong -- the first version of the NFS checking
*was* in 6.8.0 (only in condor_submit, not in condor_dagman), just not in
the documentation, unfortunately.

However, I was right in that the checking was changed somewhat between
6.8.0 and the 6.8.1 pre-release that you have -- there's now separate
configuration for condor_subit and DAGMan as to whether having a log
file on NFS is a fatal error.

You should at least get a warning from condor_submit if the log file is on
NFS, though...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Sep 14 17:17:56 2006 (1158272277)
Date: Thu, 14 Sep 2006 17:21:38 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

On Thu, Sep 14, 2006 at 05:17:56PM -0600, condor-support response tracking system wrote:
> Stuart,
> 
> > In the pre-release version of 6.8.1 running at CIT condor_run on an
> > NFS mounted directory no longer generates a warning message. Was this
> > 6.8.0 enhancement removed?
> 
> Hmm, it looks like I was wrong -- the first version of the NFS checking
> *was* in 6.8.0 (only in condor_submit, not in condor_dagman), just not in
> the documentation, unfortunately.
> 
> However, I was right in that the checking was changed somewhat between
> 6.8.0 and the 6.8.1 pre-release that you have -- there's now separate
> configuration for condor_subit and DAGMan as to whether having a log
> file on NFS is a fatal error.

What are the configuration knobs in 6.8.1pre?

> 
> You should at least get a warning from condor_submit if the log file is on
> NFS, though...

I just verified that 6.8.1pre "condor_run /bin/hostname" run from an NFS
directory without adjusting any new configuration knobs does not generate
any warning or error. In 6.8.0 this generated an incorrect error message,
see the initial [condor-support #1658] email.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Sep 14 19:22:06 2006 (1158279727)
Date: Mon, 18 Sep 2006 15:33:59 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> What are the configuration knobs in 6.8.1pre?

Sorry -- I keep kind of forgetting that the 6.8.1 manual isn't available
on the web yet...

DAGMAN_LOG_ON_NFS_IS_ERROR
    A boolean value that controls whether condor_ dagman prohibits node
job submit files with user log files on NFS. If a DAG references such a
submit file and DAGMAN_LOG_ON_NFS_IS_ERROR is True, the DAG will abort
during the initialization process. If DAGMAN_LOG_ON_NFS_IS_ERROR is False,
a warning will be issued but the DAG will still be submitted. It is
strongly recommended that DAGMAN_LOG_ON_NFS_IS_ERROR remain set to the
default value, because running a DAG with node job log files on NFS will
often cause errors. If not defined, DAGMAN_LOG_ON_NFS_IS_ERROR defaults to
True.

LOG_ON_NFS_IS_ERROR
    A boolean value that controls whether condor_ submit prohibits job
submit files with user log files on NFS. If LOG_ON_NFS_IS_ERROR is set to
True, such submit files will be rejected. If LOG_ON_NFS_IS_ERROR is set to
False, submitting such a file results in a warning, but the job will be
submitted. If not defined, LOG_ON_NFS_IS_ERROR defaults to False.

> > You should at least get a warning from condor_submit if the log file is on
> > NFS, though...
>
> I just verified that 6.8.1pre "condor_run /bin/hostname" run from an NFS
> directory without adjusting any new configuration knobs does not generate
> any warning or error. In 6.8.0 this generated an incorrect error message,
> see the initial [condor-support #1658] email.

What happens if you just do a plain "condor_submit" with the log file on
NFS?  That should produce a warning in 6.8.1pre.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Sep 18 15:35:07 2006 (1158611707)
Date: Mon, 18 Sep 2006 13:51:33 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 dbrown__AT__ligo.caltech.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

On Mon, Sep 18, 2006 at 03:35:07PM -0600, condor-support response tracking system wrote:
> Stuart,
> 
> > What are the configuration knobs in 6.8.1pre?
> 
> Sorry -- I keep kind of forgetting that the 6.8.1 manual isn't available
> on the web yet...
> 
> DAGMAN_LOG_ON_NFS_IS_ERROR
>     A boolean value that controls whether condor_ dagman prohibits node
> job submit files with user log files on NFS. If a DAG references such a
> submit file and DAGMAN_LOG_ON_NFS_IS_ERROR is True, the DAG will abort
> during the initialization process. If DAGMAN_LOG_ON_NFS_IS_ERROR is False,
> a warning will be issued but the DAG will still be submitted. It is
> strongly recommended that DAGMAN_LOG_ON_NFS_IS_ERROR remain set to the
> default value, because running a DAG with node job log files on NFS will
> often cause errors. If not defined, DAGMAN_LOG_ON_NFS_IS_ERROR defaults to
> True.
> 
> LOG_ON_NFS_IS_ERROR
>     A boolean value that controls whether condor_ submit prohibits job
> submit files with user log files on NFS. If LOG_ON_NFS_IS_ERROR is set to
> True, such submit files will be rejected. If LOG_ON_NFS_IS_ERROR is set to
> False, submitting such a file results in a warning, but the job will be
> submitted. If not defined, LOG_ON_NFS_IS_ERROR defaults to False.
> 
> > > You should at least get a warning from condor_submit if the log file is on
> > > NFS, though...
> >
> > I just verified that 6.8.1pre "condor_run /bin/hostname" run from an NFS
> > directory without adjusting any new configuration knobs does not generate
> > any warning or error. In 6.8.0 this generated an incorrect error message,
> > see the initial [condor-support #1658] email.
> 
> What happens if you just do a plain "condor_submit" with the log file on
> NFS?  That should produce a warning in 6.8.1pre.

I do not get a warning with the default settings of the two new knobs
listed above. The job simply runs to completion, e.g.,

000 (7376028.000.000) 09/18 13:47:33 Job submitted from host: <10.14.0.12:42608>
...
001 (7376028.000.000) 09/18 13:47:54 Job executing on host: <10.14.1.108:50953>
...
005 (7376028.000.000) 09/18 13:47:54 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

and the corresponding error file is empty.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Sep 18 15:51:53 2006 (1158612713)
Date: Fri, 29 Sep 2006 10:33:25 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> I do not get a warning with the default settings of the two new knobs
> listed above. The job simply runs to completion, e.g.,

Hmm, I'm looking into this on our end, and the NFS checks are working
okay in 6.8.1.

I have some special test versions of condor_submit that have extra debug
output:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/ia64_rhas_3/condor_submit_test
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rhas_3/condor_submit_test

For some reason my rh_9 build of this didn't work right.

Anyhow, please run one of these test binaries on a submit file that has
a log file on NFS, and send me the output.  That should help us in
diagnosing what is going wrong.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Sep 29 10:34:05 2006 (1159544045)
Date: Fri, 29 Sep 2006 10:34:37 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 dbrown__AT__ligo.caltech.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

On Fri, Sep 29, 2006 at 10:34:05AM -0600, condor-support response tracking system wrote:
> Stuart,
> 
> > I do not get a warning with the default settings of the two new knobs
> > listed above. The job simply runs to completion, e.g.,
> 
> Hmm, I'm looking into this on our end, and the NFS checks are working
> okay in 6.8.1.

This appears to have been a problem with just the 6.8.1 pre-release we
where running. Now that we have upgraded to the actual 6.8.1 release,
we get the following expected behavior from condor_submit with the default
setting of LOG_ON_NFS_IS_ERROR:

$ condor_submit .condor_submit.16358
Submitting job(s)
WARNING: Log file /archive/home/anderson/.condor_log.16358 is on NFS.
This could cause log file corruption and is _not_ recommended.
.
Logging submit event(s).
1 job(s) submitted to cluster 7756305.


However, the 6.8.1 release version of condor_run still does not properly
catch this error and pass it on to the user, instead it generates:

Condor does not have write permission to this directory.

This is due to the incorrect assumption in condor_run that any
WARNING message from condor_submit has to do with "write permission", i.e.,
the following loop in condor_run is incorrect:

while(<SUBMIT>) {
    if (/^1 job\(s\) submitted to cluster (\d+)./) {
        ($cluster) = $1;
    } elsif (/WARNING/) {
        &abort("Condor does not have write permission to this directory.\n");
    } else {
        $submit_errors .= $_;
    }
}


Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Sep 29 12:34:56 2006 (1159551296)
Date: Fri, 29 Sep 2006 12:57:18 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

> This appears to have been a problem with just the 6.8.1 pre-release we
> where running. Now that we have upgraded to the actual 6.8.1 release,
> we get the following expected behavior from condor_submit with the default
> setting of LOG_ON_NFS_IS_ERROR:
>
> $ condor_submit .condor_submit.16358
> Submitting job(s)
> WARNING: Log file /archive/home/anderson/.condor_log.16358 is on NFS.
> This could cause log file corruption and is _not_ recommended.

Whew!  Okay, I think maybe there was a period of a few days during 6.8.1
development where the checking was turned off while bugs were fixed --
maybe you got that version.

As long as it keeps working, don't bother with the special
condor_submit_test binaries I made.

The NFS checking should be the same as 6.8.1 in the 6.8.2 pre-release
binaries.

It sounds like this ticket is done, then, except for condor_run
tolerating the NFS warnings.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Sep 29 12:59:04 2006 (1159552744)
Date: Fri, 29 Sep 2006 11:07:24 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 dbrown__AT__ligo.caltech.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

On Fri, Sep 29, 2006 at 12:59:04PM -0600, condor-support response tracking system wrote:
> > This appears to have been a problem with just the 6.8.1 pre-release we
> > where running. Now that we have upgraded to the actual 6.8.1 release,
> > we get the following expected behavior from condor_submit with the default
> > setting of LOG_ON_NFS_IS_ERROR:
> >
> > $ condor_submit .condor_submit.16358
> > Submitting job(s)
> > WARNING: Log file /archive/home/anderson/.condor_log.16358 is on NFS.
> > This could cause log file corruption and is _not_ recommended.
> 
> Whew!  Okay, I think maybe there was a period of a few days during 6.8.1
> development where the checking was turned off while bugs were fixed --
> maybe you got that version.
> 
> As long as it keeps working, don't bother with the special
> condor_submit_test binaries I made.

OK.

> 
> The NFS checking should be the same as 6.8.1 in the 6.8.2 pre-release
> binaries.
> 
> It sounds like this ticket is done, then, except for condor_run
> tolerating the NFS warnings.

Yes, though rather than "tolerating" I would say "mangling the error message".

Perhaps "WARNING" should also be changed to "ERROR" when LOG_ON_NFS_IS_ERROR
is set to True (e.g., by default).

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Sep 29 13:07:49 2006 (1159553269)
Date: Fri, 3 Nov 2006 09:50:36 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 dbrown__AT__ligo.caltech.edu
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

This was solved in a 6.8.3 pre-release under [condor-admin #14270].

Please close this ticket.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Nov  3 11:51:00 2006 (1162576260)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Wed Nov 15 15:05:31 2006 (1163624732)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Wed Nov 15 15:09:20 2006 (1163624960)
Date: Wed, 15 Nov 2006 15:06:14 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1658] LIGO problem with condor_run 6.8.0

Stuart,

> This was solved in a 6.8.3 pre-release under [condor-admin #14270].
>
> Please close this ticket.

Great!  I'm glad to hear things got fixed.

I just got back from my extended vacation on Monday, and I'm still
catching up on email...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Nov 15 15:09:20 2006 (1163624960)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Wed Nov 22 11:52:15 2006 (1164217935)