LIGO Support Ticket 17756

Ticket Information
  Number:      admin 17756
  User:        skoranda@gravity.phys.uwm.edu
  Email:       anderson__AT__ligo.caltech.edu,nvf__AT__gravity.phys.uwm.edu
  Status:      resolved
  Assigned To: bt
Date: Mon, 17 Mar 2008 10:46:47 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,         Nickolas
 Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
Subject: LIGO Support Ticket 17748
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi,

Please append this to LIGO support Ticket 17748.

In short, in a rescue DAG some of the children had noop =True
added, and when the rescue DAG was submitted the schedd died
after some point.

Note, however, that this is for 

$ condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$ condor_dagman -version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

If you think this will be solved by upgrading to 7.0.1 and the
latest DAGman binaries Kent has produced let us know and we
will simply do that before debugging this further.

Thanks,

Scott


----- Forwarded message from Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu> -----

From: Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: Re: [Condor] Problem marlin.phys.uwm.edu: condor_schedd exited (4)
Date: Sun, 16 Mar 2008 12:00:50 -0700

It is among the children and it apparently was submitted, executed, and 
failed multiple times.  Everytime I got the assertion error, the dag would 
restart.  The same job got the same cluster number (1664477.0) repeatedly.  
This is very bizarre behavior.

The dagman.out is at 
/scratch2/nvf/GRB070714B_cbc_s5_1yr_20070129/GRB070714B/injections31/injections31.GRB070714B_injections31.dag.dagman.out 
if you would like to examine it directly.

Take care,
Nick

On Mar 16, 2008, at 5:48 AM, Scott Koranda wrote:

> Hi,
>
> The job mentioned below, 1664477.0, was one of yours.
>
> Was it one of the jobs in the rescue DAG where you added the
> noop? Or perhaps it was a parent?
>
> I will add this to the bug report but wanted to know how it
> was related...
>
> Thanks,
>
> Scott
>
>
> ----- Forwarded message from Condor User <condor__AT__gravity.phys.uwm.edu> 
> -----
>
> Date: Sat, 15 Mar 2008 20:18:13 -0500
> From: Condor User <condor__AT__gravity.phys.uwm.edu>
> To: condor-admin__AT__gravity.phys.uwm.edu
> Subject: [Condor] Problem marlin.phys.uwm.edu: condor_schedd exited (4)
>
> This is an automated email from the Condor system
> on machine "marlin.phys.uwm.edu".  Do not reply.
>
> "/opt/condor/sbin/condor_schedd" on "marlin.phys.uwm.edu" exited with 
> status 4.
> Condor will automatically restart this process in 10 seconds.
>
> *** Last 20 line(s) of file SchedLog:
> 3/15 20:17:21 (pid:2870) Calling Handler 
> <DaemonCore::HandleReqSocketHandler>
> 3/15 20:17:21 (pid:2870) Calling HandleReq <attempt_access_handler> (0)
> 3/15 20:17:21 (pid:2870) Return from HandleReq <attempt_access_handler>
> 3/15 20:17:21 (pid:2870) Return from Handler 
> <DaemonCore::HandleReqSocketHandler>
> 3/15 20:17:21 (pid:2870) DaemonCore: Command received via UDP from host 
> <192.168.0.12:46066>
> 3/15 20:17:21 (pid:2870) DaemonCore: received command 421 (RESCHEDULE), 
> calling handler (reschedule_negotiator)
> 3/15 20:17:21 (pid:2870) Calling HandleReq <reschedule_negotiator> (0)
> 3/15 20:17:21 (pid:2870) Called reschedule_negotiator()
> 3/15 20:17:21 (pid:2870) Return from HandleReq <reschedule_negotiator>
> 3/15 20:17:21 (pid:2870) Calling Handler 
> <DaemonCore::HandleReqSocketHandler>
> 3/15 20:17:21 (pid:2870) Calling HandleReq <handle_q> (0)
> 3/15 20:17:21 (pid:2870) ZKM: setting default map to nvf__AT__nemo.phys.uwm.edu
> 3/15 20:17:21 (pid:2870) Return from HandleReq <handle_q>
> 3/15 20:17:21 (pid:2870) Return from Handler 
> <DaemonCore::HandleReqSocketHandler>
> 3/15 20:17:21 (pid:2870) Calling Handler 
> <DaemonCore::HandleReqSocketHandler>
> 3/15 20:17:22 (pid:2870) Calling HandleReq <attempt_access_handler> (0)
> 3/15 20:17:22 (pid:2870) Return from HandleReq <attempt_access_handler>
> 3/15 20:17:22 (pid:2870) Return from Handler 
> <DaemonCore::HandleReqSocketHandler>
> 3/15 20:17:22 (pid:2870) Calling Timer handler 928980 (StartJobHandler)
> 3/15 20:17:22 (pid:2870) ERROR "IMPOSSIBLE: status for job 1658514.0 is 
> COMPLETED but we're trying to start a shadow for it!" at line 6647 in file 
> schedd.C
> *** End of file SchedLog
>
>
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Questions about this message or Condor in general?
> Email address of the local Condor administrator: 
> condor-admin__AT__gravity.phys.uwm.edu
> The Official Condor Homepage is http://www.cs.wisc.edu/condor
>
> ----- End forwarded message -----

===================================
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================

----- End forwarded message -----

===========================================================================
Date of creation: Mon Mar 17 10:45:19 2008 (1205768722)
Subject: Actions

Assigned to bt by bt
===========================================================================
Date of actions: Mon Mar 17  9:21:07 2008 (1205774369)
Date: Mon, 17 Mar 2008 13:04:25 -0500
From: Bill Taylor <bt__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17756] LIGO Support Ticket 17748

bt wrote:

> Hi,
> 
> Please append this to LIGO support Ticket 17748.
> 
> In short, in a rescue DAG some of the children had noop =True
> added, and when the rescue DAG was submitted the schedd died
> after some point.
> 
> Note, however, that this is for 
> 
> $ condor_version
> $CondorVersion: 6.9.4 Aug 30 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> $ condor_dagman -version
> $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> 
> If you think this will be solved by upgrading to 7.0.1 and the
> latest DAGman binaries Kent has produced let us know and we
> will simply do that before debugging this further.
> 
> Thanks,
> 
> Scott
> 

Scott,

I have added these notes to the other ticket and am closing this
one out.

Bill
Condor Team

===========================================================================
Date mail was appended: Mon Mar 17 13:04:34 2008 (1205777075)
Subject: Actions

Ticket resolved by bt
===========================================================================
Date of actions: Mon Mar 17  9:21:07 2008 (1205777100)