LIGO Support Ticket 19097

Ticket Information
  Number:      admin 19097
  User:        henning.fehrmann@aei.mpg.de
  Email:       carsten.aulbert__AT__aei.mpg.de
  Status:      resolved
  Assigned To: tannenba
Date: Thu, 12 Mar 2009 09:51:30 +0100
From: Henning Fehrmann <henning.fehrmann__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Subject: problems with schedd
X-Mimetrack: Itemize by SMTP Server on intranet/aei-hannover(Release
 8.0.2|August 07, 2008) at 03/12/2009 09:51:30,	Serialize by Router on
 intranet/aei-hannover(Release 8.0.2|August 07, 2008) at 03/12/2009
 09:51:35,	Serialize complete at 03/12/2009 09:51:35
X-PMX-Version: 5.3.2.304607, Antispam-Engine: 2.5.1.298604, Antispam-Data:
 2009.3.12.83721
X-Perlmx-Spam: Gauge=IIIIIII, Probability=7%, Report='__CD 0,
 __CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0,
 __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0'
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Hello,

we installed condor 7.2.0.

In this state, no logs are written and not jobs are started anymore. The schedd is in
the " S " state.

We had to kill the schedd and to restart it.

We set  SCHEDD_DEBUG=D_FULLDEBUG but did not find any interesting information.

Here are links pointing to the log file and the coredump.
The last entry before dying in the log file is at 07:05:46.

http://atlas.atlas.aei.uni-hannover.de/~fehrmann/SchedLog.106.gz
http://atlas.atlas.aei.uni-hannover.de/~fehrmann/schedd.crash.20090312.gz

Please tell us if you need more information.

Thank you,
Henning


===========================================================================
Date of creation: Thu Mar 12  3:51:52 2009 (1236847914)
Subject: Actions

Assigned to roy by roy
===========================================================================
Date of actions: Thu Mar 12 10:45:51 2009 (1236872751)
From: Alain Roy <roy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19097] problems with schedd
Date: Thu, 12 Mar 2009 10:49:21 -0500

Hi Henning,

This sounds serious. Normally, I would simply assign this to Pete  
Keller--he works with LIGO a lot, knows a lot about DAGman, and knows  
a lot about Condor's internals. However, he's working on the Condor  
port to Debian 5.0 at "the highest priority", so I don't want to  
distract him from that task. I'm looking around for the best person to  
address this, and I'll let you know soon.

Just so you know, Condor 7.2.1 has been released, but I am not aware  
of any bug fixes in that release that sound like they address this  
problem.

Thanks,
-alain
-----------------------------------------------------------------
Alain Roy                         vdt-support__AT__opensciencegrid.org
VDT Support                   http://vdt.cs.wisc.edu/support.html

> we installed condor 7.2.0.
>
>>
> From time to time, when a high number of dag jobs of a particular  
> kind are submitted the schedd hangs.
> In this state, no logs are written and not jobs are started anymore.  
> The schedd is in
> the " S " state.
>
> We had to kill the schedd and to restart it.
>
> We set  SCHEDD_DEBUG=D_FULLDEBUG but did not find any interesting  
> information.
>
> Here are links pointing to the log file and the coredump.
> The last entry before dying in the log file is at 07:05:46.
>
> http://atlas.atlas.aei.uni-hannover.de/~fehrmann/SchedLog.106.gz
> http://atlas.atlas.aei.uni-hannover.de/~fehrmann/schedd.crash.20090312.gz
>
> Please tell us if you need more information.


===========================================================================
Date mail was appended: Thu Mar 12 10:49:27 2009 (1236872967)
From: Alain Roy <roy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19097] problems with schedd
Date: Thu, 12 Mar 2009 10:54:38 -0500

Oh, one question: what precisely did you do to generate the core file?

Thanks,
-alain


===========================================================================
Date mail was appended: Thu Mar 12 10:54:44 2009 (1236873284)
Date: Thu, 12 Mar 2009 16:57:57 +0100
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: henning.fehrmann__AT__aei.mpg.de
Subject: Re: [condor-admin #19097] problems with schedd
X-Enigmail-Version: 0.95.7
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi Alain,

condor-admin response tracking system wrote:
> Oh, one question: what precisely did you do to generate the core file?

kill -6 <PID> on condor_schedd

That should be an ABRT, HUP did not do anything.

Thanks for looking into that.

Cheers

Carsten

===========================================================================
Date mail was appended: Thu Mar 12 10:58:11 2009 (1236873491)
Subject: Actions

Assigned to tannenba by roy
===========================================================================
Date of actions: Thu Mar 12 12:33:25 2009 (1236879206)
From: Alain Roy <roy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19097] problems with schedd
Date: Thu, 12 Mar 2009 12:36:43 -0500

Hi,

I've talked it over with people, and I am giving this ticket to Todd  
Tannenbaum to handle. I've also added it to the list of LIGO tickets  
so it is tracked properly.

http://www.cs.wisc.edu/condor/ligo-tickets/

Thanks,
-alain


===========================================================================
Date mail was appended: Thu Mar 12 12:36:49 2009 (1236879409)
Date: Thu, 12 Mar 2009 20:55:44 +0100
From: Henning Fehrmann <henning.fehrmann__AT__aei.mpg.de>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: carsten.aulbert__AT__aei.mpg.de
Subject: Re: [condor-admin #19097] problems with schedd
X-Mimetrack: Itemize by SMTP Server on intranet/aei-hannover(Release
 8.0.2|August 07, 2008) at 03/12/2009 20:55:45,	Serialize by Router on
 intranet/aei-hannover(Release 8.0.2|August 07, 2008) at 03/12/2009
 20:55:51,	Serialize complete at 03/12/2009 20:55:51
X-PMX-Version: 5.3.2.304607, Antispam-Engine: 2.5.1.298604, Antispam-Data:
 2009.3.12.194048
X-Perlmx-Spam: Gauge=IIIIIII, Probability=7%, Report='__BOUNCE_CHALLENGE_SUBJ
 0, __CD 0, __CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0,
 __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0'
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu


Hi Alain,
> Hi,
> 
> I've talked it over with people, and I am giving this ticket to Todd  
> Tannenbaum to handle. I've also added it to the list of LIGO tickets  
> so it is tracked properly.
> 
> http://www.cs.wisc.edu/condor/ligo-tickets/

Thank you. Please let us know if you need further informations.

Cheers,
Henning

===========================================================================
Date mail was appended: Thu Mar 12 14:56:06 2009 (1236887767)
Subject: Comments added

Carsten is thinking we will revisit this once the site upgrades to the latest
stable version and also updating to Lenny

Comments added by tannenba

===========================================================================
Date comments were added: Fri Jun  5 13:29:55 2009 (1244226595)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Jun 19 13:28:13 2009 (1245436093)