LIGO Support Ticket 18024

Ticket Information
  Number:      admin 18024
  User:        skoranda@gravity.phys.uwm.edu
  Email:       anderson__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: tannenba
Date: Wed, 21 May 2008 12:28:40 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO: condor_schedd crashed
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Hi,

The schedd crashed and we received the following email. This
is Condor version 7.0.1 running on FC4 x86_64.

Scott

----- Forwarded message from Condor User <condor__AT__gravity.phys.uwm.edu> -----

Date: Wed, 21 May 2008 10:58:18 -0500
From: Condor User <condor__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__gravity.phys.uwm.edu
Subject: [Condor] Problem marlin.phys.uwm.edu: condor_schedd exited (4)

This is an automated email from the Condor system
on machine "marlin.phys.uwm.edu".  Do not reply.

"/opt/condor/sbin/condor_schedd" on "marlin.phys.uwm.edu" exited with status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /opt/condor/home/log/SchedLog:
5/21 10:56:58 (pid:3734) DaemonCore: pid 16223 exited with status 25600, invoking reaper 3 <child_exit>
5/21 10:56:58 (pid:3734) Shadow pid 16223 for job 2344553.0 exited with status 100
5/21 10:56:58 (pid:3734) Checking consistency running and runnable jobs
5/21 10:56:58 (pid:3734) Tables are consistent
5/21 10:56:58 (pid:3734) Rebuilt prioritized runnable job list in 0.004s.
5/21 10:56:58 (pid:3734) DaemonCore: return from reaper for pid 16223
5/21 10:56:58 (pid:3734) Starting add_shadow_birthdate(2346129.0)
5/21 10:56:58 (pid:3734) Started shadow for job 2346129.0 on "<192.168.3.117:56931>", (shadow pid = 26467)
5/21 10:56:58 (pid:3734) DaemonCore: Command received via TCP from host <192.168.3.117:54380>, access level WRITE
5/21 10:56:58 (pid:3734) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
5/21 10:56:58 (pid:3734) Calling HandleReq <vacate_service> (0)
5/21 10:56:58 (pid:3734) Got VACATE_SERVICE from <192.168.3.117:54380>
5/21 10:56:58 (pid:3734) Match record (<192.168.3.117:56931>, 2346129, 0) deleted
5/21 10:56:58 (pid:3734) Return from HandleReq <vacate_service>
5/21 10:56:58 (pid:3734) DaemonCore: pid 26467 exited with status 27648, invoking reaper 3 <child_exit>
5/21 10:56:58 (pid:3734) Shadow pid 26467 for job 2346129.0 exited with status 108
5/21 10:56:58 (pid:3734) DaemonCore: return from reaper for pid 26467
5/21 10:57:01 (pid:3734) DaemonCore: pid 21561 exited with status 27392, invoking reaper 3 <child_exit>
5/21 10:57:01 (pid:3734) Shadow pid 21561 for job 2281454.0 exited with status 107
5/21 10:57:01 (pid:3734) ERROR "ERROR no job status for 2281454.0 in Scheduler::jobExitCode()!" at line 9224 in file schedd.C
*** End of file SchedLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: condor-admin__AT__gravity.phys.uwm.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor

----- End forwarded message -----

===========================================================================
Date of creation: Wed May 21 12:25:51 2008 (1211390753)
Subject: Actions

Assigned to tannenba by gthain
===========================================================================
Date of actions: Thu May 22 11:37:55 2008 (1211474275)
Date: Mon, 09 Jun 2008 16:02:35 -0500
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18024] LIGO: condor_schedd crashed

> Hi,
> 
> The schedd crashed and we received the following email. This
> is Condor version 7.0.1 running on FC4 x86_64.
> 
> Scott
> 
> 5/21 10:57:01 (pid:3734) ERROR "ERROR no job status for 2281454.0 in Scheduler::jobExitCode()!" at line 9224 in file schedd.C
> *** End of file SchedLog
>

Hi -

We believe we have patched the bug that caused the above error to 
happen. It was related to folks using "condor_rm -force".

The patch made it into Condor v7.0.2 which will appear on the web site 
this week.

just FYI : if you want to reproduce the above error deterministically, 
suspend the shadow (via SIGSTOP) of a running job, remove the job with 
"condor_rm -force", and then resume the shadow.  That sequence will 
cause the schedd to exit with the above error.  In the wild, a race 
condition triggered by condor_rm -force caused the problem.

Thanks for reporting this bug!!!


best regards,
UW-Madison Condor Team (this time you got Todd)



===========================================================================
Date mail was appended: Mon Jun  9 16:03:38 2008 (1213045419)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Mon Jun  9 16:03:38 2008 (1213045420)