LIGO Support Ticket 1943
Ticket Information
Number: support 1943
User: espinoza@ligo.caltech.edu
Email: anderson__AT__ligo.caltech.edu,espinoza_e__AT__ligo.caltech.edu
Status: resolved
Assigned To: tannenba
Date: Wed, 28 Mar 2007 10:54:00 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: roy__AT__cs.wisc.edu, Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO: Schedd Crash
Greetings,
We had a schedd crash on our main submit machine. Here is the obit e-mail.
Here is a link to the following files:
SchedLog: SchedLog for 3/28
QuillLog: QuillLog for 3/28
condor_schedd.txt: condor_schedd obituary
condor_quill.txt: condor_quill obituary
http://www.ligo.caltech.edu/~eespinoz/debug/03-28-2007
There was no core file.
Here is a little more context:
[root@ldas-grid sbin]# strings condor_schedd | grep \$Condor
$CondorVersion: 6.8.4 Feb 1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$CondorVersion:
$CondorPlatform:
[root@ldas-grid sbin]# strings condor_quill | grep \$Condor
$CondorVersion: 6.8.4 Feb 1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
[root@ldas-grid sbin]# cat /etc/fedora-release
Fedora Core release 4 (Stentz)
[root@ldas-grid sbin]# cd ../bin
[root@ldas-grid bin]# ./condor_version
$CondorVersion: 6.8.4 Feb 1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
Alain, please add this to the ligo-tickets page.
Thanks,
Erik
-------- Original Message --------
Subject: [Condor] Problem
Date: Wed, 28 Mar 2007 08:41:04 -0700
From: condor__AT__ldas-grid.ligo.caltech.edu
To: ldas_admin_cit__AT__ligo.caltech.edu
This is an automated email from the Condor system
on machine "ldas-grid.ligo.caltech.edu". Do not reply.
"/ldcg/condor/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited
with status 44.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file SchedLog:
3/28 08:39:05 (pid:10514) Activity on stashed negotiator socket
3/28 08:39:05 (pid:10514) Negotiating for owner: cokelaer@ligo
3/28 08:39:05 (pid:10514) Checking consistency running and runnable jobs
3/28 08:39:05 (pid:10514) Tables are consistent
3/28 08:39:05 (pid:10514) Out of servers - 0 jobs matched, 5 jobs idle,
1 jobs rejected
3/28 08:39:05 (pid:10514) Activity on stashed negotiator socket
3/28 08:39:05 (pid:10514) Negotiating for owner: cokelaer@ligo
3/28 08:39:05 (pid:10514) Checking consistency running and runnable jobs
3/28 08:39:05 (pid:10514) Tables are consistent
3/28 08:39:05 (pid:10514) Out of jobs - 5 jobs matched, 0 jobs idle,
flock level = 0
3/28 08:39:06 (pid:10514) Shadow pid 3510 for job 11365335.0 exited with
status 100
3/28 08:39:08 (pid:10514) match (<10.14.1.7:56834>#1173983829#833) out
of jobs (cluster id 11366033); relinquishing
3/28 08:39:08 (pid:10514) Sent RELEASE_CLAIM to startd on <10.14.1.7:56834>
3/28 08:39:08 (pid:10514) Match record (<10.14.1.7:56834>, 11366033, 0)
deleted
3/28 08:39:08 (pid:10514) statfs() failed: 13/Permission denied
3/28 08:39:08 (pid:10514) Starting add_shadow_birthdate(11366034.0)
3/28 08:39:08 (pid:10514) Started shadow for job 11366034.0 on
"<10.14.1.55:40092>", (shadow pid = 6350)
3/28 08:39:08 (pid:10514) Starting add_shadow_birthdate(11366032.0)
3/28 08:39:08 (pid:10514) Started shadow for job 11366032.0 on
"<10.14.1.155:50283>", (shadow pid = 6351)
3/28 08:39:08 (pid:10514) ReliSock::get_file˙
*** End of file SchedLog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator:
ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date of creation: Wed Mar 28 12:56:15 2007 (1175104578)
Subject: Actions
Assigned to tannenba by bgietzel
===========================================================================
Date of actions: Wed Mar 28 15:02:14 2007 (1175112134)
Date: Wed, 28 Mar 2007 15:31:47 -0500
To: condor-support__AT__cs.wisc.edu
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
Subject: Re: [condor-support #1943] LIGO: Schedd Crash
>Greetings,
>
>We had a schedd crash on our main submit machine. Here is the obit e-mail.
From the obit email, it says:
>>"/ldcg/condor/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited
>>with status 44.
When a Condor daemon exits with status 44, it means that it failed to
write to its log (in this case, there was an error writing to the ScheddLog).
So there would not be a core file, since the schedd didn't "crash" per se.
However, in your LOG directory is there a "dprintf_failure.SCHEDD"
file? When a Condor daemon fails to write to its log, it tries to
write an error message into a dprintf_failure* file and then exits
with status 44.
Does the above help?
regards,
Todd
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba__AT__cs.wisc.edu 1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba Madison, WI 53706-1685
Phone: (608) 263-7132 FAX: (608) 262-9777
===========================================================================
Date mail was appended: Wed Mar 28 15:39:05 2007 (1175114345)
Subject: Actions
Status changed from open to pending by tannenba
===========================================================================
Date of actions: Wed Mar 28 15:39:05 2007 (1175114346)
Date: Wed, 28 Mar 2007 14:14:22 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1943] LIGO: Schedd Crash
Hi Todd,
> When a Condor daemon exits with status 44, it means that it failed to
> write to its log (in this case, there was an error writing to the ScheddLog).
>
> So there would not be a core file, since the schedd didn't "crash" per se.
Makes sense.
> However, in your LOG directory is there a "dprintf_failure.SCHEDD"
> file? When a Condor daemon fails to write to its log, it tries to
> write an error message into a dprintf_failure* file and then exits
> with status 44.
We did get a dprintf_failure, but the files were empty.
This leads me to believe that there was a disk space issue, however the
schedd/quill daemons were restarted automatically by condor_master and
continued along happily.
This is really peculiar because condor is on a separate partition,
completely segregated from the rest of the system. Only condor has the
ability to write, so although it's possible that the partition filled up
it was cleaned up in ~30 seconds for the respawn by condor_master.
By the time I checked the disk, it had ~15 gigs available.
Thanks,
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Wed Mar 28 16:15:01 2007 (1175116504)
Date: Fri, 6 Apr 2007 12:47:57 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1943] LIGO: Schedd Crash
I believe this is sufficiently understood to close this ticket.
I have opened a new condor-admin ticket (#15277) to persue whether
there are any enhancements to DAGMan that may make this likely
to happen again in the future.
Thanks.
On Wed, Mar 28, 2007 at 02:14:22PM -0700, Erik A. Espinoza wrote:
> Hi Todd,
>
> >When a Condor daemon exits with status 44, it means that it failed to
> >write to its log (in this case, there was an error writing to the
> >ScheddLog).
> >
> >So there would not be a core file, since the schedd didn't "crash" per se.
>
> Makes sense.
>
> >However, in your LOG directory is there a "dprintf_failure.SCHEDD"
> >file? When a Condor daemon fails to write to its log, it tries to
> >write an error message into a dprintf_failure* file and then exits
> >with status 44.
>
> We did get a dprintf_failure, but the files were empty.
>
> This leads me to believe that there was a disk space issue, however the
> schedd/quill daemons were restarted automatically by condor_master and
> continued along happily.
>
> This is really peculiar because condor is on a separate partition,
> completely segregated from the rest of the system. Only condor has the
> ability to write, so although it's possible that the partition filled up
> it was cleaned up in ~30 seconds for the respawn by condor_master.
>
> By the time I checked the disk, it had ~15 gigs available.
>
> Thanks,
> --
> Erik A. Espinoza
> Systems Administrator
> LIGO/Caltech - MS 18-34
> Pasadena, CA 91125
> Ph: 626-395-8517
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Apr 6 14:48:19 2007 (1175888899)
Date: Mon, 09 Apr 2007 11:27:08 -0500
To: condor-support__AT__cs.wisc.edu
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
Subject: Re: [condor-support #1943] LIGO: Schedd Crash
At 02:48 PM 4/6/2007, condor-support response tracking system wrote:
>I believe this is sufficiently understood to close this ticket.
>
>I have opened a new condor-admin ticket (#15277) to persue whether
>there are any enhancements to DAGMan that may make this likely
>to happen again in the future.
>
>Thanks.
Sounds good.
Thanks Erik,
regards,
Todd
===========================================================================
Date mail was appended: Mon Apr 9 11:33:54 2007 (1176136435)
Subject: Actions
Ticket resolved by tannenba
===========================================================================
Date of actions: Mon Apr 9 11:33:54 2007 (1176136437)