LIGO Support Ticket 19097
Ticket Information
Number: admin 19097
User: henning.fehrmann@aei.mpg.de
Email: carsten.aulbert__AT__aei.mpg.de
Status: resolved
Assigned To: tannenba
Date: Thu, 12 Mar 2009 09:51:30 +0100
From: Henning Fehrmann <henning.fehrmann__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Subject: problems with schedd
X-Mimetrack: Itemize by SMTP Server on intranet/aei-hannover(Release
8.0.2|August 07, 2008) at 03/12/2009 09:51:30, Serialize by Router on
intranet/aei-hannover(Release 8.0.2|August 07, 2008) at 03/12/2009
09:51:35, Serialize complete at 03/12/2009 09:51:35
X-PMX-Version: 5.3.2.304607, Antispam-Engine: 2.5.1.298604, Antispam-Data:
2009.3.12.83721
X-Perlmx-Spam: Gauge=IIIIIII, Probability=7%, Report='__CD 0,
__CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0,
__MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0'
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
Hello,
we installed condor 7.2.0.
In this state, no logs are written and not jobs are started anymore. The schedd is in
the " S " state.
We had to kill the schedd and to restart it.
We set SCHEDD_DEBUG=D_FULLDEBUG but did not find any interesting information.
Here are links pointing to the log file and the coredump.
The last entry before dying in the log file is at 07:05:46.
http://atlas.atlas.aei.uni-hannover.de/~fehrmann/SchedLog.106.gz
http://atlas.atlas.aei.uni-hannover.de/~fehrmann/schedd.crash.20090312.gz
Please tell us if you need more information.
Thank you,
Henning
===========================================================================
Date of creation: Thu Mar 12 3:51:52 2009 (1236847914)
Subject: Actions
Assigned to roy by roy
===========================================================================
Date of actions: Thu Mar 12 10:45:51 2009 (1236872751)
From: Alain Roy <roy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19097] problems with schedd
Date: Thu, 12 Mar 2009 10:49:21 -0500
Hi Henning,
This sounds serious. Normally, I would simply assign this to Pete
Keller--he works with LIGO a lot, knows a lot about DAGman, and knows
a lot about Condor's internals. However, he's working on the Condor
port to Debian 5.0 at "the highest priority", so I don't want to
distract him from that task. I'm looking around for the best person to
address this, and I'll let you know soon.
Just so you know, Condor 7.2.1 has been released, but I am not aware
of any bug fixes in that release that sound like they address this
problem.
Thanks,
-alain
-----------------------------------------------------------------
Alain Roy vdt-support__AT__opensciencegrid.org
VDT Support http://vdt.cs.wisc.edu/support.html
> we installed condor 7.2.0.
>
>>
> From time to time, when a high number of dag jobs of a particular
> kind are submitted the schedd hangs.
> In this state, no logs are written and not jobs are started anymore.
> The schedd is in
> the " S " state.
>
> We had to kill the schedd and to restart it.
>
> We set SCHEDD_DEBUG=D_FULLDEBUG but did not find any interesting
> information.
>
> Here are links pointing to the log file and the coredump.
> The last entry before dying in the log file is at 07:05:46.
>
> http://atlas.atlas.aei.uni-hannover.de/~fehrmann/SchedLog.106.gz
> http://atlas.atlas.aei.uni-hannover.de/~fehrmann/schedd.crash.20090312.gz
>
> Please tell us if you need more information.
===========================================================================
Date mail was appended: Thu Mar 12 10:49:27 2009 (1236872967)
From: Alain Roy <roy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19097] problems with schedd
Date: Thu, 12 Mar 2009 10:54:38 -0500
Oh, one question: what precisely did you do to generate the core file?
Thanks,
-alain
===========================================================================
Date mail was appended: Thu Mar 12 10:54:44 2009 (1236873284)
Date: Thu, 12 Mar 2009 16:57:57 +0100
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: henning.fehrmann__AT__aei.mpg.de
Subject: Re: [condor-admin #19097] problems with schedd
X-Enigmail-Version: 0.95.7
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi Alain,
condor-admin response tracking system wrote:
> Oh, one question: what precisely did you do to generate the core file?
kill -6 <PID> on condor_schedd
That should be an ABRT, HUP did not do anything.
Thanks for looking into that.
Cheers
Carsten
===========================================================================
Date mail was appended: Thu Mar 12 10:58:11 2009 (1236873491)
Subject: Actions
Assigned to tannenba by roy
===========================================================================
Date of actions: Thu Mar 12 12:33:25 2009 (1236879206)
From: Alain Roy <roy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19097] problems with schedd
Date: Thu, 12 Mar 2009 12:36:43 -0500
Hi,
I've talked it over with people, and I am giving this ticket to Todd
Tannenbaum to handle. I've also added it to the list of LIGO tickets
so it is tracked properly.
http://www.cs.wisc.edu/condor/ligo-tickets/
Thanks,
-alain
===========================================================================
Date mail was appended: Thu Mar 12 12:36:49 2009 (1236879409)
Date: Thu, 12 Mar 2009 20:55:44 +0100
From: Henning Fehrmann <henning.fehrmann__AT__aei.mpg.de>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: carsten.aulbert__AT__aei.mpg.de
Subject: Re: [condor-admin #19097] problems with schedd
X-Mimetrack: Itemize by SMTP Server on intranet/aei-hannover(Release
8.0.2|August 07, 2008) at 03/12/2009 20:55:45, Serialize by Router on
intranet/aei-hannover(Release 8.0.2|August 07, 2008) at 03/12/2009
20:55:51, Serialize complete at 03/12/2009 20:55:51
X-PMX-Version: 5.3.2.304607, Antispam-Engine: 2.5.1.298604, Antispam-Data:
2009.3.12.194048
X-Perlmx-Spam: Gauge=IIIIIII, Probability=7%, Report='__BOUNCE_CHALLENGE_SUBJ
0, __CD 0, __CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0,
__MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0'
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi Alain,
> Hi,
>
> I've talked it over with people, and I am giving this ticket to Todd
> Tannenbaum to handle. I've also added it to the list of LIGO tickets
> so it is tracked properly.
>
> http://www.cs.wisc.edu/condor/ligo-tickets/
Thank you. Please let us know if you need further informations.
Cheers,
Henning
===========================================================================
Date mail was appended: Thu Mar 12 14:56:06 2009 (1236887767)
Subject: Comments added
Carsten is thinking we will revisit this once the site upgrades to the latest
stable version and also updating to Lenny
Comments added by tannenba
===========================================================================
Date comments were added: Fri Jun 5 13:29:55 2009 (1244226595)
Subject: Actions
Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Jun 19 13:28:13 2009 (1245436093)