LIGO Support Ticket 19807

Ticket Information
  Number:      admin 19807
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      open
  Assigned To: gthain
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: error code for submitting non-std-universe jobs
Date: Fri, 9 Oct 2009 15:05:26 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

What do you think about enhancing the way Condor handles the error  
condition where a user submits a non-standard-universe job to the  
standard universe?

Currently, Condor 7.4.0 does not detect this until it gets to the  
Starter and then a specific error message does not propagate backup  
up, e.g., the Shadow logs the generic 102 "JOB_KILLED" message.

One possible enhancement would be to have an explicit Shadow exit  
status for this.

More generally, is there a convenient condor-way to catch this type of  
error sooner and get the error message back to a user that might not  
have access to the Starter logs?

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date of creation: Fri Oct  9 17:05:42 2009 (1255125945)
Subject: Actions

Assigned to gthain by gthain
===========================================================================
Date of actions: Mon Oct 12 11:39:47 2009 (1255365587)
Date: Mon, 12 Oct 2009 11:57:08 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
 non-std-universe jobs


>
> What do you think about enhancing the way Condor handles the error  
> condition where a user submits a non-standard-universe job to the  
> standard universe?
>
>   
Stuart:

I think this is a great area for improvement in Condor.  What would you 
think about a check at condor_submit time,and have condor_submit issue a 
warning?  The only tricky think is we can't guarantee that condor_submit 
can find the executable to check on it.  This is because the executable 
can have macros in it like $$(ARCH).

-Greg



===========================================================================
Date mail was appended: Mon Oct 12 11:57:14 2009 (1255366634)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
 non-std-universe jobs
Date: Mon, 12 Oct 2009 10:16:03 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu


On Oct 12, 2009, at 9:57 AM, condor-admin response tracking system  
wrote:

>
>>
>> What do you think about enhancing the way Condor handles the error
>> condition where a user submits a non-standard-universe job to the
>> standard universe?
>>
>>
> Stuart:
>
> I think this is a great area for improvement in Condor.  What would  
> you
> think about a check at condor_submit time,and have condor_submit  
> issue a
> warning?  The only tricky think is we can't guarantee that  
> condor_submit
> can find the executable to check on it.  This is because the  
> executable
> can have macros in it like $$(ARCH).

A check at submit time would be very helpful, and if that does not  
catch all
the conditions then perhaps another check further down the call chain
would be helpful in addition.

Note, the most common case of this for LIGO is with condor_submit_dag
so handling that gracefully and informatively for the end user would be
very helpful.

Thanks.


--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Mon Oct 12 12:16:15 2009 (1255367776)
Date: Tue, 13 Oct 2009 10:34:47 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
 non-std-universe jobs


>
> Note, the most common case of this for LIGO is with condor_submit_dag
> so handling that gracefully and informatively for the end user would be
> very helpful.
>
>   
Stuart:

Perhaps you already know this, but if you set

notification = error

then the user will be sent an email with a pretty clear error message in 
this case.

-Greg


===========================================================================
Date mail was appended: Tue Oct 13 10:34:52 2009 (1255448093)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19807] LIGO: error code for submitting
 non-std-universe jobs
Date: Tue, 13 Oct 2009 09:50:21 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu


On Oct 13, 2009, at 8:34 AM, condor-admin response tracking system  
wrote:

>
>>
>> Note, the most common case of this for LIGO is with condor_submit_dag
>> so handling that gracefully and informatively for the end user  
>> would be
>> very helpful.
>>
>>
> Stuart:
>
> Perhaps you already know this, but if you set
>
> notification = error
>
> then the user will be sent an email with a pretty clear error  
> message in
> this case.

Yes, we have this setting in SUBMIT_EXPRS as a default for users.  
However,
some of our users set notification to never before submitting a DAG with
100,000+ jobs in it for somewhat obvious reasons.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Tue Oct 13 11:50:33 2009 (1255452633)