LIGO Support Ticket 19807
Ticket Information
Number: admin 19807
User: anderson@ligo.caltech.edu
Email:
Status: open
Assigned To: gthain
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: error code for submitting non-std-universe jobs
Date: Fri, 9 Oct 2009 15:05:26 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
What do you think about enhancing the way Condor handles the error
condition where a user submits a non-standard-universe job to the
standard universe?
Currently, Condor 7.4.0 does not detect this until it gets to the
Starter and then a specific error message does not propagate backup
up, e.g., the Shadow logs the generic 102 "JOB_KILLED" message.
One possible enhancement would be to have an explicit Shadow exit
status for this.
More generally, is there a convenient condor-way to catch this type of
error sooner and get the error message back to a user that might not
have access to the Starter logs?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Fri Oct 9 17:05:42 2009 (1255125945)
Subject: Actions
Assigned to gthain by gthain
===========================================================================
Date of actions: Mon Oct 12 11:39:47 2009 (1255365587)
Date: Mon, 12 Oct 2009 11:57:08 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
non-std-universe jobs
>
> What do you think about enhancing the way Condor handles the error
> condition where a user submits a non-standard-universe job to the
> standard universe?
>
>
Stuart:
I think this is a great area for improvement in Condor. What would you
think about a check at condor_submit time,and have condor_submit issue a
warning? The only tricky think is we can't guarantee that condor_submit
can find the executable to check on it. This is because the executable
can have macros in it like $$(ARCH).
-Greg
===========================================================================
Date mail was appended: Mon Oct 12 11:57:14 2009 (1255366634)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
non-std-universe jobs
Date: Mon, 12 Oct 2009 10:16:03 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Oct 12, 2009, at 9:57 AM, condor-admin response tracking system
wrote:
>
>>
>> What do you think about enhancing the way Condor handles the error
>> condition where a user submits a non-standard-universe job to the
>> standard universe?
>>
>>
> Stuart:
>
> I think this is a great area for improvement in Condor. What would
> you
> think about a check at condor_submit time,and have condor_submit
> issue a
> warning? The only tricky think is we can't guarantee that
> condor_submit
> can find the executable to check on it. This is because the
> executable
> can have macros in it like $$(ARCH).
A check at submit time would be very helpful, and if that does not
catch all
the conditions then perhaps another check further down the call chain
would be helpful in addition.
Note, the most common case of this for LIGO is with condor_submit_dag
so handling that gracefully and informatively for the end user would be
very helpful.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Oct 12 12:16:15 2009 (1255367776)
Date: Tue, 13 Oct 2009 10:34:47 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
non-std-universe jobs
>
> Note, the most common case of this for LIGO is with condor_submit_dag
> so handling that gracefully and informatively for the end user would be
> very helpful.
>
>
Stuart:
Perhaps you already know this, but if you set
notification = error
then the user will be sent an email with a pretty clear error message in
this case.
-Greg
===========================================================================
Date mail was appended: Tue Oct 13 10:34:52 2009 (1255448093)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
non-std-universe jobs
Date: Tue, 13 Oct 2009 09:50:21 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Oct 13, 2009, at 8:34 AM, condor-admin response tracking system
wrote:
>
>>
>> Note, the most common case of this for LIGO is with condor_submit_dag
>> so handling that gracefully and informatively for the end user
>> would be
>> very helpful.
>>
>>
> Stuart:
>
> Perhaps you already know this, but if you set
>
> notification = error
>
> then the user will be sent an email with a pretty clear error
> message in
> this case.
Yes, we have this setting in SUBMIT_EXPRS as a default for users.
However,
some of our users set notification to never before submitting a DAG with
100,000+ jobs in it for somewhat obvious reasons.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Oct 13 11:50:33 2009 (1255452633)