LIGO Support Ticket 1943

Ticket Information
  Number:      support 1943
  User:        espinoza@ligo.caltech.edu
  Email:       anderson__AT__ligo.caltech.edu,espinoza_e__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: tannenba
Date: Wed, 28 Mar 2007 10:54:00 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: roy__AT__cs.wisc.edu, Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO: Schedd Crash

Greetings,

We had a schedd crash on our main submit machine. Here is the obit e-mail.

Here is a link to the following files:
SchedLog: SchedLog for 3/28
QuillLog: QuillLog for 3/28
condor_schedd.txt: condor_schedd obituary
condor_quill.txt: condor_quill obituary

http://www.ligo.caltech.edu/~eespinoz/debug/03-28-2007

There was no core file.

Here is a little more context:
[root@ldas-grid sbin]# strings condor_schedd | grep \$Condor
$CondorVersion: 6.8.4 Feb  1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$CondorVersion:
$CondorPlatform:	
[root@ldas-grid sbin]# strings condor_quill | grep \$Condor
$CondorVersion: 6.8.4 Feb  1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
[root@ldas-grid sbin]# cat /etc/fedora-release
Fedora Core release 4 (Stentz)
[root@ldas-grid sbin]# cd ../bin
[root@ldas-grid bin]# ./condor_version
$CondorVersion: 6.8.4 Feb  1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $


Alain, please add this to the ligo-tickets page.

Thanks,
Erik

-------- Original Message --------
Subject: [Condor] Problem
Date: Wed, 28 Mar 2007 08:41:04 -0700
From: condor__AT__ldas-grid.ligo.caltech.edu
To: ldas_admin_cit__AT__ligo.caltech.edu

This is an automated email from the Condor system
on machine "ldas-grid.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited 
with status 44.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
3/28 08:39:05 (pid:10514) Activity on stashed negotiator socket
3/28 08:39:05 (pid:10514) Negotiating for owner: cokelaer@ligo
3/28 08:39:05 (pid:10514) Checking consistency running and runnable jobs
3/28 08:39:05 (pid:10514) Tables are consistent
3/28 08:39:05 (pid:10514) Out of servers - 0 jobs matched, 5 jobs idle, 
1 jobs rejected
3/28 08:39:05 (pid:10514) Activity on stashed negotiator socket
3/28 08:39:05 (pid:10514) Negotiating for owner: cokelaer@ligo
3/28 08:39:05 (pid:10514) Checking consistency running and runnable jobs
3/28 08:39:05 (pid:10514) Tables are consistent
3/28 08:39:05 (pid:10514) Out of jobs - 5 jobs matched, 0 jobs idle, 
flock level = 0
3/28 08:39:06 (pid:10514) Shadow pid 3510 for job 11365335.0 exited with 
status 100
3/28 08:39:08 (pid:10514) match (<10.14.1.7:56834>#1173983829#833) out 
of jobs (cluster id 11366033); relinquishing
3/28 08:39:08 (pid:10514) Sent RELEASE_CLAIM to startd on <10.14.1.7:56834>
3/28 08:39:08 (pid:10514) Match record (<10.14.1.7:56834>, 11366033, 0) 
deleted
3/28 08:39:08 (pid:10514) statfs() failed: 13/Permission denied
3/28 08:39:08 (pid:10514) Starting add_shadow_birthdate(11366034.0)
3/28 08:39:08 (pid:10514) Started shadow for job 11366034.0 on 
"<10.14.1.55:40092>", (shadow pid = 6350)
3/28 08:39:08 (pid:10514) Starting add_shadow_birthdate(11366032.0)
3/28 08:39:08 (pid:10514) Started shadow for job 11366032.0 on 
"<10.14.1.155:50283>", (shadow pid = 6351)
3/28 08:39:08 (pid:10514) ReliSock::get_file˙
*** End of file SchedLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: 
ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date of creation: Wed Mar 28 12:56:15 2007 (1175104578)
Subject: Actions

Assigned to tannenba by bgietzel
===========================================================================
Date of actions: Wed Mar 28 15:02:14 2007 (1175112134)
Date: Wed, 28 Mar 2007 15:31:47 -0500
To: condor-support__AT__cs.wisc.edu
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
Subject: Re: [condor-support #1943] LIGO: Schedd Crash



>Greetings,
>
>We had a schedd crash on our main submit machine. Here is the obit e-mail.


 From the obit email, it says:
>>"/ldcg/condor/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited
>>with status 44.


When a Condor daemon exits with status 44, it means that it failed to 
write to its log (in this case, there was an error writing to the ScheddLog).

So there would not be a core file, since the schedd didn't "crash" per se.

However, in your LOG directory is there a "dprintf_failure.SCHEDD" 
file?  When a Condor daemon fails to write to its log, it tries to 
write an error message into a dprintf_failure* file and then exits 
with status 44.

Does the above help?

regards,
Todd


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba__AT__cs.wisc.edu                  1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba      Madison, WI 53706-1685
Phone: (608) 263-7132  FAX: (608) 262-9777


===========================================================================
Date mail was appended: Wed Mar 28 15:39:05 2007 (1175114345)
Subject: Actions

Status changed from open to pending by tannenba
===========================================================================
Date of actions: Wed Mar 28 15:39:05 2007 (1175114346)
Date: Wed, 28 Mar 2007 14:14:22 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1943] LIGO: Schedd Crash

Hi Todd,

> When a Condor daemon exits with status 44, it means that it failed to 
> write to its log (in this case, there was an error writing to the ScheddLog).
> 
> So there would not be a core file, since the schedd didn't "crash" per se.

Makes sense.

> However, in your LOG directory is there a "dprintf_failure.SCHEDD" 
> file?  When a Condor daemon fails to write to its log, it tries to 
> write an error message into a dprintf_failure* file and then exits 
> with status 44.

We did get a dprintf_failure, but the files were empty.

This leads me to believe that there was a disk space issue, however the 
schedd/quill daemons were restarted automatically by condor_master and 
continued along happily.

This is really peculiar because condor is on a separate partition, 
completely segregated from the rest of the system. Only condor has the 
ability to write, so although it's possible that the partition filled up 
it was cleaned up in ~30 seconds for the respawn by condor_master.

By the time I checked the disk, it had ~15 gigs available.

Thanks,
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Mar 28 16:15:01 2007 (1175116504)
Date: Fri, 6 Apr 2007 12:47:57 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1943] LIGO: Schedd Crash

I believe this is sufficiently understood to close this ticket.

I have opened a new condor-admin ticket (#15277) to persue whether
there are any enhancements to DAGMan that may make this likely
to happen again in the future.

Thanks.

On Wed, Mar 28, 2007 at 02:14:22PM -0700, Erik A. Espinoza wrote:
> Hi Todd,
> 
> >When a Condor daemon exits with status 44, it means that it failed to 
> >write to its log (in this case, there was an error writing to the 
> >ScheddLog).
> >
> >So there would not be a core file, since the schedd didn't "crash" per se.
> 
> Makes sense.
> 
> >However, in your LOG directory is there a "dprintf_failure.SCHEDD" 
> >file?  When a Condor daemon fails to write to its log, it tries to 
> >write an error message into a dprintf_failure* file and then exits 
> >with status 44.
> 
> We did get a dprintf_failure, but the files were empty.
> 
> This leads me to believe that there was a disk space issue, however the 
> schedd/quill daemons were restarted automatically by condor_master and 
> continued along happily.
> 
> This is really peculiar because condor is on a separate partition, 
> completely segregated from the rest of the system. Only condor has the 
> ability to write, so although it's possible that the partition filled up 
> it was cleaned up in ~30 seconds for the respawn by condor_master.
> 
> By the time I checked the disk, it had ~15 gigs available.
> 
> Thanks,
> -- 
> Erik A. Espinoza
> Systems Administrator
> LIGO/Caltech - MS 18-34
> Pasadena, CA 91125
> Ph: 626-395-8517

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Apr  6 14:48:19 2007 (1175888899)
Date: Mon, 09 Apr 2007 11:27:08 -0500
To: condor-support__AT__cs.wisc.edu
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
Subject: Re: [condor-support #1943] LIGO: Schedd Crash

At 02:48 PM 4/6/2007, condor-support response tracking system wrote:
>I believe this is sufficiently understood to close this ticket.
>
>I have opened a new condor-admin ticket (#15277) to persue whether
>there are any enhancements to DAGMan that may make this likely
>to happen again in the future.
>
>Thanks.

Sounds good.

Thanks Erik,
regards,
Todd


===========================================================================
Date mail was appended: Mon Apr  9 11:33:54 2007 (1176136435)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Mon Apr  9 11:33:54 2007 (1176136437)