LIGO Support Ticket 19180
Ticket Information
Number: admin 19180
User: carsten.aulbert@aei.mpg.de
Email:
Status: open
Assigned To: psilord
Date: Tue, 07 Apr 2009 15:03:41 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: schedd tries to connect to unresponsive node and stalls on read
X-Enigmail-Version: 0.95.7
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi all,
this might be the same bug Scott Koranda of UWM fame reported about a
month ago.
The schedd on one of our nodes stopped shortly after restarting it. The
target node it tries to connect to was unresponsive and after doing name
resolution it tried to connect (which was seemingly successful), however
the subsequent read on the handle blocked (sorry that I don't have the
strace log, I only had 1000 lines line buffer). From what I gathered, it
parsed the job queue from the spool directlory and tried to connect to a
service on the remote host (checkpointing server?).
Sorry that I don't have more right now, the log files were telling me
nothing at that time...
Cheers
Carsten
--
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31
===========================================================================
Date of creation: Tue Apr 7 8:04:00 2009 (1239109443)
Subject: Actions
Assigned to tannenba by wenger
===========================================================================
Date of actions: Tue Apr 7 13:39:40 2009 (1239129581)
Date: Wed, 08 Apr 2009 08:29:32 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19180] LIGO: schedd tries to connect to
unresponsive node and stalls on read
X-Enigmail-Version: 0.95.7
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi again,
schedd stopped again, here is a strace:
08:21:35.854859 write(11, "103 11055075.0 MaxHosts 1\n", 26) = 26
08:21:35.854902 fsync(11) = 0
08:21:35.855069 write(11, "104 11055075.0 RemotePool\n", 26) = 26
08:21:35.855111 fsync(11) = 0
08:21:35.855298 write(11, "103 11054428.0 CurrentHosts 0\n", 30) = 30
08:21:35.855340 fsync(11) = 0
08:21:35.855508 write(11, "104 11054428.0 ShadowBday\n", 26) = 26
08:21:35.855549 fsync(11) = 0
08:21:35.855740 write(11, "103 11054428.0 JobStatus 1\n", 27) = 27
08:21:35.855781 fsync(11) = 0
08:21:35.855949 write(11, "103 11054428.0 EnteredCurrentSta"..., 47) = 47
08:21:35.855990 fsync(11) = 0
08:21:35.856172 write(11, "103 11054428.0 LastSuspensionTim"..., 36) = 36
08:21:35.856214 fsync(11) = 0
08:21:35.856381 write(11, "103 11054428.0 MaxHosts 1\n", 26) = 26
08:21:35.856421 fsync(11) = 0
08:21:35.856593 write(11, "104 11054428.0 RemotePool\n", 26) = 26
08:21:35.856635 fsync(11) = 0
08:21:35.857139 stat("/local/condor.h2/spool/cluster11055235.proc0.subproc0", 0x9326828) = -1 ENOENT (No such file or directory)
08:21:35.857199 unlink("/local/condor.h2/spool/cluster11055235.proc0.subproc0") = -1 ENOENT (No such file or directory)
08:21:35.857265 open("/etc/hosts", O_RDONLY) = 12
08:21:35.857312 fcntl(12, F_GETFD) = 0
08:21:35.857348 fcntl(12, F_SETFD, FD_CLOEXEC) = 0
08:21:35.857382 fstat(12, {st_mode=S_IFREG|0644, st_size=202589, ...}) = 0
08:21:35.857436 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22aa857000
08:21:35.857477 read(12, "################################"..., 4096) = 4096
08:21:35.857556 read(12, ".10.1.14 n0114.atlas.local n0114"..., 4096) = 4096
08:21:35.857629 read(12, "231.atlas.local n0231\n10.10.2.32"..., 4096) = 4096
08:21:35.857702 read(12, "ocal n0348\n10.10.3.49 n0349.atla"..., 4096) = 4096
08:21:35.857775 read(12, "10.10.4.66 n0466.atlas.local n04"..., 4096) = 4096
08:21:35.857848 read(12, "n0583.atlas.local n0583\n10.10.5."..., 4096) = 4096
08:21:35.857920 read(12, "local n0700\n10.10.7.1 n0701.atla"..., 4096) = 4096
08:21:35.857994 read(12, "18 n0818.atlas.local n0818\n10.10"..., 4096) = 4096
08:21:35.858066 read(12, "las.local n0935\n10.10.9.36 n0936"..., 4096) = 4096
08:21:35.858141 read(12, "051.atlas.local n1051\n10.10.10.5"..., 4096) = 4096
08:21:35.858219 read(12, "5.atlas.local n1165\n10.10.11.66 "..., 4096) = 4096
08:21:35.858296 close(12) = 0
08:21:35.858330 munmap(0x7f22aa857000, 4096) = 0
08:21:35.858373 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 12
08:21:35.858430 bind(12, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
08:21:35.858488 connect(12, {sa_family=AF_INET, sin_port=htons(5653), sin_addr=inet_addr("10.10.12.72")}, 16) = 0
08:21:35.858755 setsockopt(12, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
08:21:35.858816 getpid() = 8233
08:21:35.858858 write(12, "a\2243K\0\0\0\0\0\300\0\0\0\0\0\0\0\0 )\0\0\0\0vaulin__AT__"..., 592) = 592
08:21:35.858917 read(12,
Here schedd blocks again - unfortunately the node is stuck (I cannot even log into the node via
IPMI and serial-over LAN). After I power-cycled the machine the connection was cut and schedd continued to run.
Does this help?
Cheers
Carsten
===========================================================================
Date mail was appended: Wed Apr 8 1:29:46 2009 (1239172187)
Subject: Actions
Assigned to psilord by psilord
===========================================================================
Date of actions: Fri Jun 12 13:26:55 2009 (1244831215)
Date: Thu, 16 Jul 2009 13:32:25 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: psilord <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19180] LIGO: schedd tries to connect to
unresponsive node and stalls on read
Hello,
> From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
> The schedd on one of our nodes stopped shortly after restarting it. The
> target node it tries to connect to was unresponsive and after doing name
> resolution it tried to connect (which was seemingly successful), however
> the subsequent read on the handle blocked (sorry that I don't have the
> strace log, I only had 1000 lines line buffer). From what I gathered, it
> parsed the job queue from the spool directlory and tried to connect to a
> service on the remote host (checkpointing server?).
The problem of the schedd blocking while talking to a checkpoint server
is fixed and committed into the stable series. It will appear in 7.2.5.
Please read the version history carefully when it comes out, since it
describes new configuration file parameters which control the feature on
both the schedd and the ckpt server side.
The schedd side's configuration parameters control how long the schedd waits
in a connect to the ckpt server before putting a moratorium on connecting
to it, and the second is how long the moratorium period lasts.
The ckpt server's side of the bugfix is a new feature in that the ckpt
server automatically cleans up stale checkpoints, defined by the file
not being accessed in a configurable amount of seconds. How often the
checkpoint server performs this check is also configurable.
Reasonable defaults have been chosen for all of these new knobs.
The biggest thing to be aware of is that the removal of the stale
checkpoint feature is _on by default_ and it will check and remove stale
files immediately upon startup. Everything is recorded into the log file,
in case something bad happens.
The reason why the new feature had to be implemented was because while
the schedd can leak checkpoints into a down checkpoint server at this
time, the rate of leakage is pretty small. However, with my new bug fix,
the rate may be orders of magnitude faster--so a solution was needed
out of the box.
Hopefully, this will solve the problem for you.
Have a nice day.
-pete
===========================================================================
Date mail was appended: Thu Jul 16 13:32:31 2009 (1247769151)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19180] LIGO: schedd tries to connect to
unresponsive node and stalls on read
Date: Wed, 22 Jul 2009 21:30:54 +0200
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Thursday 16 July 2009 20:32:31 condor-admin response tracking system wrote:
> Hopefully, this will solve the problem for you.
Thanks a lot, we will try it out soon :)
Cheers
Carsten
===========================================================================
Date mail was appended: Wed Jul 22 14:30:56 2009 (1248291057)