LIGO Support Ticket 18024
Ticket Information
Number: admin 18024
User: skoranda@gravity.phys.uwm.edu
Email: anderson__AT__ligo.caltech.edu
Status: resolved
Assigned To: tannenba
Date: Wed, 21 May 2008 12:28:40 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO: condor_schedd crashed
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
Hi,
The schedd crashed and we received the following email. This
is Condor version 7.0.1 running on FC4 x86_64.
Scott
----- Forwarded message from Condor User <condor__AT__gravity.phys.uwm.edu> -----
Date: Wed, 21 May 2008 10:58:18 -0500
From: Condor User <condor__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__gravity.phys.uwm.edu
Subject: [Condor] Problem marlin.phys.uwm.edu: condor_schedd exited (4)
This is an automated email from the Condor system
on machine "marlin.phys.uwm.edu". Do not reply.
"/opt/condor/sbin/condor_schedd" on "marlin.phys.uwm.edu" exited with status 4.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file /opt/condor/home/log/SchedLog:
5/21 10:56:58 (pid:3734) DaemonCore: pid 16223 exited with status 25600, invoking reaper 3 <child_exit>
5/21 10:56:58 (pid:3734) Shadow pid 16223 for job 2344553.0 exited with status 100
5/21 10:56:58 (pid:3734) Checking consistency running and runnable jobs
5/21 10:56:58 (pid:3734) Tables are consistent
5/21 10:56:58 (pid:3734) Rebuilt prioritized runnable job list in 0.004s.
5/21 10:56:58 (pid:3734) DaemonCore: return from reaper for pid 16223
5/21 10:56:58 (pid:3734) Starting add_shadow_birthdate(2346129.0)
5/21 10:56:58 (pid:3734) Started shadow for job 2346129.0 on "<192.168.3.117:56931>", (shadow pid = 26467)
5/21 10:56:58 (pid:3734) DaemonCore: Command received via TCP from host <192.168.3.117:54380>, access level WRITE
5/21 10:56:58 (pid:3734) DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
5/21 10:56:58 (pid:3734) Calling HandleReq <vacate_service> (0)
5/21 10:56:58 (pid:3734) Got VACATE_SERVICE from <192.168.3.117:54380>
5/21 10:56:58 (pid:3734) Match record (<192.168.3.117:56931>, 2346129, 0) deleted
5/21 10:56:58 (pid:3734) Return from HandleReq <vacate_service>
5/21 10:56:58 (pid:3734) DaemonCore: pid 26467 exited with status 27648, invoking reaper 3 <child_exit>
5/21 10:56:58 (pid:3734) Shadow pid 26467 for job 2346129.0 exited with status 108
5/21 10:56:58 (pid:3734) DaemonCore: return from reaper for pid 26467
5/21 10:57:01 (pid:3734) DaemonCore: pid 21561 exited with status 27392, invoking reaper 3 <child_exit>
5/21 10:57:01 (pid:3734) Shadow pid 21561 for job 2281454.0 exited with status 107
5/21 10:57:01 (pid:3734) ERROR "ERROR no job status for 2281454.0 in Scheduler::jobExitCode()!" at line 9224 in file schedd.C
*** End of file SchedLog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: condor-admin__AT__gravity.phys.uwm.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor
----- End forwarded message -----
===========================================================================
Date of creation: Wed May 21 12:25:51 2008 (1211390753)
Subject: Actions
Assigned to tannenba by gthain
===========================================================================
Date of actions: Thu May 22 11:37:55 2008 (1211474275)
Date: Mon, 09 Jun 2008 16:02:35 -0500
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18024] LIGO: condor_schedd crashed
> Hi,
>
> The schedd crashed and we received the following email. This
> is Condor version 7.0.1 running on FC4 x86_64.
>
> Scott
>
> 5/21 10:57:01 (pid:3734) ERROR "ERROR no job status for 2281454.0 in Scheduler::jobExitCode()!" at line 9224 in file schedd.C
> *** End of file SchedLog
>
Hi -
We believe we have patched the bug that caused the above error to
happen. It was related to folks using "condor_rm -force".
The patch made it into Condor v7.0.2 which will appear on the web site
this week.
just FYI : if you want to reproduce the above error deterministically,
suspend the shadow (via SIGSTOP) of a running job, remove the job with
"condor_rm -force", and then resume the shadow. That sequence will
cause the schedd to exit with the above error. In the wild, a race
condition triggered by condor_rm -force caused the problem.
Thanks for reporting this bug!!!
best regards,
UW-Madison Condor Team (this time you got Todd)
===========================================================================
Date mail was appended: Mon Jun 9 16:03:38 2008 (1213045419)
Subject: Actions
Ticket resolved by tannenba
===========================================================================
Date of actions: Mon Jun 9 16:03:38 2008 (1213045420)