Ticket Information Number: support 1678 User: anderson@ligo.caltech.edu Email: espinoza_e__AT__ligo.caltech.edu,patrick__AT__gravity.phys.uwm.edu,lazz__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu Status: resolved Assigned To: gthain
Date: Sat, 9 Sep 2006 15:12:42 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu Subject: LIGO condor-6.8.1 schedd exit status 4 A pre-release of condor-6.8.1 recently provided by Alan for running on the LIGO CIT cluster just had the schedd exit with status 4 and the following message. Note this occured 6 minutes after the segfault reported in [condor-support #1677] and is similar to the spawnJobs errors reported in [condor-support #1652] which also occured ~6 minutes after a schedd crash. Note there does not appear to have been a network failutre at this time, but the load on the head-node was at least 50 (normally just a few) during the schedd restart and may be the cause of the failure to reconnect messages. It appears that a restart of schedd with a busy queue (~1000 jobs) which has a mix of Standard/Vanilla and parallel (Dedicated Scheduler) jobs in it is definitely not safe. "/ldcg/condor/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited with status 4. Condor will automatically restart this process in 10 seconds. *** Last 20 line(s) of file SchedLog: 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm1__AT__node246.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm4__AT__node165.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm1__AT__node217.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm2__AT__node209.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm1__AT__node20.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm3__AT__node253.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm2__AT__node253.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm2__AT__node258.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm3__AT__node115.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm2__AT__node119.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm4__AT__node271.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm2__AT__node189.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm1__AT__node178.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm2__AT__node222.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm1__AT__node56.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm2__AT__node56.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm4__AT__node164.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm3__AT__node239.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine vm4__AT__node159.ldas-cit.ligo.caltech.edu to reconnect to 9/9 13:26:33 (pid:20456) ERROR "spawnJobs(): allocation node has no matches!" at line 2351 in file dedicated_scheduler.C *** End of file SchedLog -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date of creation: Sat Sep 9 17:18:03 2006 (1157840289)
Subject: Actions Assigned to adesmet by adesmet =========================================================================== Date of actions: Mon Sep 11 15:09:27 2006 (1158005367)
Subject: Actions Assigned to gthain by adesmet =========================================================================== Date of actions: Mon Sep 11 15:12:34 2006 (1158005554)
Date: Mon, 11 Sep 2006 16:57:34 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: condor-support__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 We now have a second schedd running that is mean to just submit parallel universe jobs and both schedd's are now running D_FULLDEBUG--which is generating ~1GB per hour of log files. One interesting thing I have noticed is that even though ldas-grid is no longer running any parallel universe jobs the following log file section indicates that is trying to process one. 9/11 16:43:14 Saving classad to history file 9/11 16:43:14 SelfDrainingQueue job_is_finished_queue is empty, not resetting timer 9/11 16:43:14 Canceling timer for SelfDrainingQueue job_is_finished_queue (timer id: 298564) 9/11 16:43:14 In DedicatedScheduler::checkReconnectQueue 9/11 16:43:14 In checkReconnectQueue(), job: 191697400.159333944 9/11 16:43:14 Job 191697400.159333944 missing from queue? 9/11 16:43:14 Trying to query collector <10.14.0.12:9618> 9/11 16:43:15 DedicatedScheduler found machine vm1__AT__node1.ldas-cit.ligo.caltech.edu for possibly reconnection for job (191697400.159333944) 9/11 16:43:15 DedicatedScheduler found machine vm3__AT__node4.ldas-cit.ligo.caltech.edu for possibly reconnection for job (191697400.159333944) 9/11 16:43:15 DedicatedScheduler found machine vm1__AT__node5.ldas-cit.ligo.caltech.edu for possibly reconnection for job (191697400.159333944) It is perhaps important to note that current valid job id's are somewhere around 7051962.0 not 191M and certainly not .159333944. This is in the running schedd on ldas-grid. Is there anything useful we can extract from the running ReconnectQueue to figure out what it is confused about? Thanks. -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Mon Sep 11 18:58:07 2006 (1158019088)
Date: Wed, 13 Sep 2006 10:13:51 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 It happened again this morning, i.e., 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions, giving up! (6270 PIDs being tracked internally.) 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file datathread.C Fortunately, I was able to get to the Schedd log files before they rolled over. Please see the readme file for further comments and my initial observations on the crash relating to the dedicated scheduler and duplicate rescue dagas at, http://www.ligo.caltech.edu/~anderson/condor.1678/readme The detailed log files may be found in the same directory. Thanks. -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Wed Sep 13 12:16:02 2006 (1158167763)
Date: Wed, 13 Sep 2006 12:48:40 -0500 To: Stuart Anderson <anderson__AT__ligo.caltech.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu From: Miron Livny <miron__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu> Who is looking into this on the Condor side? Miron At 12:13 PM 9/13/2006, Stuart Anderson wrote: >It happened again this morning, i.e., > >9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions, >giving up! (6270 PIDs being tracked internally.) >9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file >datathread.C > >Fortunately, I was able to get to the Schedd log files before they >rolled over. Please see the readme file for further comments and my >initial observations on the crash relating to the dedicated scheduler >and duplicate rescue dagas at, >http://www.ligo.caltech.edu/~anderson/condor.1678/readme > >The detailed log files may be found in the same directory. > > >Thanks. > >-- >Stuart Anderson anderson__AT__ligo.caltech.edu >http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Wed Sep 13 12:51:38 2006 (1158169898)
Date: Wed, 13 Sep 2006 13:00:34 -0500 From: Greg Thain <gthain__AT__cs.wisc.edu> To: Miron Livny <miron__AT__cs.wisc.edu> CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Miron Livny wrote: > Who is looking into this on the Condor side? Miron I am. -Greg =========================================================================== Date mail was appended: Wed Sep 13 13:00:58 2006 (1158170459)
Date: Wed, 13 Sep 2006 16:38:14 -0500 From: Greg Thain <gthain__AT__cs.wisc.edu> To: Stuart Anderson <anderson__AT__ligo.caltech.edu> CC: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Stuart Anderson wrote: > It happened again this morning, i.e., > > 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions, > giving up! (6270 PIDs being tracked internally.) > 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file > datathread.C Stuart: The logs have been very helpful in identifying what is going on here. While we are working on a code fix, I think that upping the parameter MAX_PID_COLLISION_RETRY again to 1000 should be helpful. Thanks, and I'll be updating you with more information, -Greg =========================================================================== Date mail was appended: Wed Sep 13 16:41:38 2006 (1158183700)
Date: Wed, 13 Sep 2006 15:42:39 -0700 From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu> To: condor-support__AT__cs.wisc.edu CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu, patrick__AT__gravity.phys.uwm.edu, lazz__AT__ligo.caltech.edu Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 X-Enigmail-Version: 0.94.0.0 Openpgp: url=http://pgp.mit.edu/ X-No-Archive: Yes X-Archive: No The MAX_PID_COLLISION_RETRY has been raised to 1,000. Thanks, Erik condor-support response tracking system wrote: > Stuart Anderson wrote: >> It happened again this morning, i.e., >> >> 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions, >> giving up! (6270 PIDs being tracked internally.) >> 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file >> datathread.C > > Stuart: > > The logs have been very helpful in identifying what is going on here. > While we are working on a code fix, I think that upping the parameter > > MAX_PID_COLLISION_RETRY > > again to 1000 should be helpful. > > Thanks, and I'll be updating you with more information, > > -Greg > > > ======================================== > MESSAGE INFORMATION > ======================================== > * From: Greg Thain <gthain__AT__cs.wisc.edu> > * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,patrick__AT__gravity.phys.uwm.edu,lazz__AT__ligo.caltech.edu > -- Erik A. Espinoza Systems Administrator LIGO/Caltech - MS 18-34 Pasadena, CA 91125 Ph: 626-395-8517 =========================================================================== Date mail was appended: Wed Sep 13 17:43:06 2006 (1158187387)
Date: Wed, 13 Sep 2006 20:39:04 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: Greg Thain <gthain__AT__cs.wisc.edu> CC: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 On Wed, Sep 13, 2006 at 04:38:14PM -0500, Greg Thain wrote: > Stuart Anderson wrote: > >It happened again this morning, i.e., > > > >9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid > >collisions, > >giving up! (6270 PIDs being tracked internally.) > >9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file > >datathread.C > > Stuart: > > The logs have been very helpful in identifying what is going on here. > While we are working on a code fix, I think that upping the parameter > > MAX_PID_COLLISION_RETRY > > again to 1000 should be helpful. Are you sure? As Erik indicated earlier we have made this change, however, in a recent discussion Alan indicated this retry mechanism was implemented via recursion so there was some concern about recursing too deeply. > > Thanks, and I'll be updating you with more information, I look forward to that. Thanks. -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Wed Sep 13 22:40:58 2006 (1158205259)
Date: Thu, 14 Sep 2006 09:40:51 -0500 From: Greg Thain <gthain__AT__cs.wisc.edu> To: condor-support__AT__cs.wisc.edu Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 > Are you sure? As Erik indicated earlier we have made this change, however, > in a recent discussion Alan indicated this retry mechanism was implemented > via recursion so there was some concern about recursing too deeply. Yes, although there is some concern about deep recursion, I think it will be ok, and even if it isn't, it is no worse than the alternative. We are discussing code fixes now, and hope to be able to update you soon. -Greg =========================================================================== Date mail was appended: Thu Sep 14 9:40:57 2006 (1158244858)
Date: Thu, 14 Sep 2006 11:03:22 -0500 (CDT) From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> To: Stuart Anderson <anderson__AT__ligo.caltech.edu> CC: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Stuart, > 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions, > giving up! (6270 PIDs being tracked internally.) > 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file > datathread.C > > Fortunately, I was able to get to the Schedd log files before they > rolled over. Please see the readme file for further comments and my > initial observations on the crash relating to the dedicated scheduler > and duplicate rescue dagas at, > http://www.ligo.caltech.edu/~anderson/condor.1678/readme > > The detailed log files may be found in the same directory. Do you have the relevant dagman.out file anywhere? Do you know if the "original" (pre-schedd-crash) DAGMan was started after "DAGMAN_ABORT_DUPLICATES = True" was added to the configuration? Kent Wenger Condor Team =========================================================================== Date mail was appended: Thu Sep 14 11:07:35 2006 (1158250056)
Date: Fri, 15 Sep 2006 18:05:55 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> CC: condor-support__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 On Thu, Sep 14, 2006 at 11:03:22AM -0500, R. Kent Wenger wrote: > Stuart, > > > 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions, > > giving up! (6270 PIDs being tracked internally.) > > 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file > > datathread.C > > > > Fortunately, I was able to get to the Schedd log files before they > > rolled over. Please see the readme file for further comments and my > > initial observations on the crash relating to the dedicated scheduler > > and duplicate rescue dagas at, > > http://www.ligo.caltech.edu/~anderson/condor.1678/readme > > > > The detailed log files may be found in the same directory. > > Do you have the relevant dagman.out file anywhere? > > Do you know if the "original" (pre-schedd-crash) DAGMan was started after > "DAGMAN_ABORT_DUPLICATES = True" was added to the configuration? > Erik, please correct me if I am wrong, but I believe we determined that these DAGMan's where started after the configuration change, however, you where unable to find the dagman.out files. Thanks. -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Fri Sep 15 20:14:06 2006 (1158369247)
Date: Sun, 17 Sep 2006 22:52:10 -0500 To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>, Stuart Anderson <anderson__AT__ligo.caltech.edu> From: Miron Livny <miron__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 CC: condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu> Greg and Kent, Any news or progress? Miron At 11:03 AM 9/14/2006, R. Kent Wenger wrote: >Stuart, > > > 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid > collisions, > > giving up! (6270 PIDs being tracked internally.) > > 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file > > datathread.C > > > > Fortunately, I was able to get to the Schedd log files before they > > rolled over. Please see the readme file for further comments and my > > initial observations on the crash relating to the dedicated scheduler > > and duplicate rescue dagas at, > > http://www.ligo.caltech.edu/~anderson/condor.1678/readme > > > > The detailed log files may be found in the same directory. > >Do you have the relevant dagman.out file anywhere? > >Do you know if the "original" (pre-schedd-crash) DAGMan was started after >"DAGMAN_ABORT_DUPLICATES = True" was added to the configuration? > >Kent Wenger >Condor Team =========================================================================== Date mail was appended: Sun Sep 17 22:53:49 2006 (1158551630)
Date: Mon, 18 Sep 2006 08:12:25 -0500 From: Greg Thain <gthain__AT__cs.wisc.edu> To: Miron Livny <miron__AT__cs.wisc.edu> CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>, Stuart Anderson <anderson__AT__ligo.caltech.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Miron Livny wrote: > Greg and Kent, > > Any news or progress? > > Miron On the PID collision front, we understand why it happens, and we are investigating several possible solutions. In the short term, LIGO has implemented a configuration change which we believe will lessen the chance of hitting this problem. -Greg =========================================================================== Date mail was appended: Mon Sep 18 8:12:57 2006 (1158585178)
Date: Mon, 18 Sep 2006 15:31:42 -0500 (CDT) From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> To: Miron Livny <miron__AT__cs.wisc.edu> CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, Albert Lazzarini <lazz__AT__ligo.caltech.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Miron, > Any news or progress? I have an idea on the DAGMan end, but I can't confirm it without a dagman.out file. I may send them a pre-release DAGMan that will fix things if that is the problem. Kent =========================================================================== Date mail was appended: Mon Sep 18 15:35:05 2006 (1158611707)
Date: Mon, 18 Sep 2006 13:40:15 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> CC: Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 On Mon, Sep 18, 2006 at 03:31:42PM -0500, R. Kent Wenger wrote: > Miron, > > > Any news or progress? > > I have an idea on the DAGMan end, but I can't confirm it without a > dagman.out file. > > I may send them a pre-release DAGMan that will fix things if that is the > problem. Kent, Is this a possible fix for the root problem or a fix for the "DAGMAN_ABORT_DUPLICATES = True" patch? To avoid the problem of users cleaning up their own dagman.out files before we can find them for a problem report, would it be possible, and desirable, to add a DAGMAN debug option that keeps a more permanent copy of any interesting dagman related logging information in a similar fashion to the main condor daemons, e.g., DAGMAN_DEBUG = ...? Thanks. -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Mon Sep 18 15:45:06 2006 (1158612307)
Date: Mon, 18 Sep 2006 17:37:47 -0500 (CDT) From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> To: Stuart Anderson <anderson__AT__ligo.caltech.edu> CC: Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Stuart, > On Mon, Sep 18, 2006 at 03:31:42PM -0500, R. Kent Wenger wrote: > > > > I may send them a pre-release DAGMan that will fix things if that is the > > problem. > > Is this a possible fix for the root problem or a fix for the > "DAGMAN_ABORT_DUPLICATES = True" patch? This would be a fix to the DAGMAN_ABORT_DUPLICATES feature. You need RedHat 9 binaries, right? > To avoid the problem of users cleaning up their own dagman.out > files before we can find them for a problem report, would it be possible, > and desirable, to add a DAGMAN debug option that keeps a more permanent > copy of any interesting dagman related logging information in a similar > fashion to the main condor daemons, e.g., DAGMAN_DEBUG = ...? Well, there's already a DAGMAN_DEBUG that controls how much info goes into the dagman.out file. The main change I can think of would be an option to give the dagman.out file a unique name -- maybe something like foobar.dag.dagman.out.<machine>.<cluster>. That way re-running the same DAG wouldn't overwrite an existing dagman.out file. Do you think something like that would solve the problem? My inclination is to not add something that essentially duplicates dagman.out. Kent Wenger Condor Team =========================================================================== Date mail was appended: Mon Sep 18 17:37:54 2006 (1158619075)
Date: Mon, 18 Sep 2006 16:06:41 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>, Brown Duncan <duncan__AT__gravity.phys.uwm.edu> CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Duncan, Please read below. On Mon, Sep 18, 2006 at 05:37:54PM -0600, condor-support response tracking system wrote: > Stuart, > > > On Mon, Sep 18, 2006 at 03:31:42PM -0500, R. Kent Wenger wrote: > > > > > > I may send them a pre-release DAGMan that will fix things if that is the > > > problem. > > > > Is this a possible fix for the root problem or a fix for the > > "DAGMAN_ABORT_DUPLICATES = True" patch? > > This would be a fix to the DAGMAN_ABORT_DUPLICATES feature. > > You need RedHat 9 binaries, right? Kent, I am still confused about the last discussion on the re-basing of Condor Linux builds from RH9 to RHEL3, so whichever one of these you think is the "right" one to run on an FC4 x86_64 machine will be fine. > > > To avoid the problem of users cleaning up their own dagman.out > > files before we can find them for a problem report, would it be possible, > > and desirable, to add a DAGMAN debug option that keeps a more permanent > > copy of any interesting dagman related logging information in a similar > > fashion to the main condor daemons, e.g., DAGMAN_DEBUG = ...? > > Well, there's already a DAGMAN_DEBUG that controls how much info goes > into the dagman.out file. > > The main change I can think of would be an option to give the dagman.out > file a unique name -- maybe something like > foobar.dag.dagman.out.<machine>.<cluster>. That way re-running the same > DAG wouldn't overwrite an existing dagman.out file. > > Do you think something like that would solve the problem? My inclination > is to not add something that essentially duplicates dagman.out. > Kent, This sounds reasonable to me but I often get confused about all the files generated when running a DAG (and what they should be called). Duncan, What do you think? The goal is to come up with a scheme that makes it easier to provide Kent with the necessary debugging information in the future when we have problems with DAG's without confusing our users or breaking current analysis pipelines. Thanks. -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Mon Sep 18 18:07:00 2006 (1158620820)
Date: Tue, 19 Sep 2006 17:17:41 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: Greg Thain <gthain__AT__cs.wisc.edu> CC: Miron Livny <miron__AT__cs.wisc.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Greg, Alan, others? What bugs have been fixed in the 6.8.1 release today compared to the 6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech? i.e., should we upgrade to 6.8.1 final or wait for more of these bug fixes that are being worked on? Any advice would be appreciated. Thanks. On Mon, Sep 18, 2006 at 08:12:25AM -0500, Greg Thain wrote: > Miron Livny wrote: > >Greg and Kent, > > > >Any news or progress? > > > >Miron > > On the PID collision front, we understand why it happens, and we are > investigating several possible solutions. In the short term, LIGO has > implemented a configuration change which we believe will lessen the > chance of hitting this problem. > > -Greg -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Tue Sep 19 19:19:05 2006 (1158711546)
Date: Wed, 20 Sep 2006 10:28:48 -0500 From: Alan De Smet <adesmet__AT__cs.wisc.edu> To: Stuart Anderson <anderson__AT__ligo.caltech.edu> CC: Greg Thain <gthain__AT__cs.wisc.edu>, Miron Livny <miron__AT__cs.wisc.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Stuart Anderson <anderson__AT__ligo.caltech.edu> wrote: > What bugs have been fixed in the 6.8.1 release today compared to the > 6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech? Not a lot, although there might be a security fix or two. I'll get you more complete details and a recommendation (wait or upgrade) in a moment. -- Alan De Smet Condor Project Research adesmet__AT__cs.wisc.edu http://www.condorproject.org/ =========================================================================== Date mail was appended: Wed Sep 20 10:31:02 2006 (1158766263)
Date: Wed, 20 Sep 2006 10:46:34 -0500 From: Alan De Smet <adesmet__AT__cs.wisc.edu> To: Stuart Anderson <anderson__AT__ligo.caltech.edu> CC: Greg Thain <gthain__AT__cs.wisc.edu>, Miron Livny <miron__AT__cs.wisc.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Stuart Anderson <anderson__AT__ligo.caltech.edu> wrote: > What bugs have been fixed in the 6.8.1 release today compared to the > 6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech? A number of infrequent bugs have been fixed. - Improvements to DAGMan's duplicate run detection. - Fixes to some negotiator and schedd optimizations that could cause Condor to make incorrect matching decisions. - Support for large STARTD_NAMEs. - Fixes to CLAIMTOBE authorization to allow more precise filtering. Not enabled by default because of backward compatibility issues. - shadows are better at detecting that the associated starter is gone for good (say, because of a reboot), speeding up a number of things. - File descriptor leak fixed. > i.e., should we upgrade to 6.8.1 final or wait for more of these bug > fixes that are being worked on? Any advice would be appreciated. You should probably upgrade, but there is no rush. You're not (to our knowledge) getting hit by most of the bugs we fixed. You are hitting the DAGMan problems, but Kent's giving you binaries that are even newer than 6.8.1. You might be hitting the negotiator/schedd optimization problems, but the impact is minor (jobs run in the "wrong" order). We're working on a more permenant fix for your PID collision problems; I'm hoping to be able to offer you a new schedd in about a week. Given that replacing your schedd is very severe for your pool, you might put off the 6.8.1 upgrade until we can provide a new schedd binary. If the upgrade isn't a big deal, you can certainly upgrade sooner and you might get some minor benefits. -- Alan De Smet Condor Project Research adesmet__AT__cs.wisc.edu http://www.condorproject.org/ =========================================================================== Date mail was appended: Wed Sep 20 10:51:20 2006 (1158767480)
Date: Wed, 20 Sep 2006 22:02:56 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: condor-support response tracking system <condor-support__AT__cs.wisc.edu> CC: espinoza_e__AT__ligo.caltech.edu, patrick__AT__gravity.phys.uwm.edu, espinoza__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 Alan, Thanks for the details. We will stick with 6.8.1pre for now and wait for the 6.8.1post schedd before upgrading. Is there any chance that this schedd will also have fixes for the dedicatedScheduler problems? Thanks. On Wed, Sep 20, 2006 at 10:51:20AM -0600, condor-support response tracking system wrote: > Stuart Anderson <anderson__AT__ligo.caltech.edu> wrote: > > What bugs have been fixed in the 6.8.1 release today compared to the > > 6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech? > > A number of infrequent bugs have been fixed. > > - Improvements to DAGMan's duplicate run detection. > > - Fixes to some negotiator and schedd optimizations that could > cause Condor to make incorrect matching decisions. > > - Support for large STARTD_NAMEs. > > - Fixes to CLAIMTOBE authorization to allow more precise > filtering. Not enabled by default because of backward > compatibility issues. > > - shadows are better at detecting that the associated starter is > gone for good (say, because of a reboot), speeding up a number > of things. > > - File descriptor leak fixed. > > > i.e., should we upgrade to 6.8.1 final or wait for more of these bug > > fixes that are being worked on? Any advice would be appreciated. > > You should probably upgrade, but there is no rush. > > You're not (to our knowledge) getting hit by most of the bugs we > fixed. You are hitting the DAGMan problems, but Kent's giving > you binaries that are even newer than 6.8.1. You might be > hitting the negotiator/schedd optimization problems, but the > impact is minor (jobs run in the "wrong" order). > > We're working on a more permenant fix for your PID collision > problems; I'm hoping to be able to offer you a new schedd in > about a week. Given that replacing your schedd is very severe > for your pool, you might put off the 6.8.1 upgrade until we can > provide a new schedd binary. If the upgrade isn't a big deal, > you can certainly upgrade sooner and you might get some minor > benefits. > > -- > Alan De Smet Condor Project Research > adesmet__AT__cs.wisc.edu http://www.condorproject.org/ > > > ======================================== > MESSAGE INFORMATION > ======================================== > * From: Alan De Smet <adesmet__AT__cs.wisc.edu> > * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,patrick__AT__gravity.phys.uwm.edu,lazz__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu > > -- > ====================================================================== > This mail was sent from the RUST Mail System > Please direct all replies to condor-support__AT__cs.wisc.edu > Please include the current subject line in your reply. > ====================================================================== > -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Thu Sep 21 0:03:15 2006 (1158814996)
Date: Thu, 21 Sep 2006 10:49:31 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu,
tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,
gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,
Patrick Brady <patrick__AT__gravity.phys.uwm.edu>,
"R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4
Stuart,
Okay, I have a 6.8.2 pre-release condor_dagman for you, which should fix
the problems with the "abort duplicate runs" feature.
You can get it at:
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/condor_dagman
Note that in order for this condor_dagman to properly abort itself, the
earlier condor_dagman must also be the 6.8.2 pre-release version, not
6.8.1. In other words, to get the "no duplicates" protection on existing
runs, you'll have to condor_rm them and start the rescue DAGs with the new
condor_dagman binary.
All of the configuration macros are the same as for 6.8.1.
Please let me know if you have any questions.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Thu Sep 21 10:50:53 2006 (1158853854)
Domainkey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=BgY6QYQDLg0edOz8GIkmcyymkkW5DWBfp7XH7XbnQUFRaOV4LWmqu+bSBYVQPmsdjuGT162M0/ct5nbfj74GZkGGa7RvkW6CkVnE1M7vLbXgbEXPWKjXQajpbct2s9xMMLuk3ot14eRJEkFGcnve1EZl6R4UjZWzGmnm9BjCe1E= Date: Tue, 26 Sep 2006 11:39:52 -0700 From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu> To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 CC: "Stuart Anderson" <anderson__AT__ligo.caltech.edu>, "Miron Livny" <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, "Erik Espinoza" <espinoza_e__AT__ligo.caltech.edu>, "Patrick Brady" <patrick__AT__gravity.phys.uwm.edu> X-Google-Sender-Auth: 248618f05cf708cc Could we have this in RHAS please? Thanks, Erik On 9/21/06, R. Kent Wenger <wenger__AT__cs.wisc.edu> wrote: > Stuart, > > Okay, I have a 6.8.2 pre-release condor_dagman for you, which should fix > the problems with the "abort duplicate runs" feature. > > You can get it at: > > ftp://ftp.cs.wisc.edu/condor/temporary/forligo/condor_dagman > > Note that in order for this condor_dagman to properly abort itself, the > earlier condor_dagman must also be the 6.8.2 pre-release version, not > 6.8.1. In other words, to get the "no duplicates" protection on existing > runs, you'll have to condor_rm them and start the rescue DAGs with the new > condor_dagman binary. > > All of the configuration macros are the same as for 6.8.1. > > Please let me know if you have any questions. > > Kent Wenger > Condor Team > =========================================================================== Date mail was appended: Tue Sep 26 13:40:14 2006 (1159296015)
Date: Tue, 26 Sep 2006 18:44:55 -0500 (CDT) From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> To: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu> CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 On Tue, 26 Sep 2006, Erik A. Espinoza wrote: > Could we have this in RHAS please? Okay, I'm planning to get a couple more fixes in this week before I leave. If it's okay with you, I'll wait until those are in place -- that would probably mean some time Thursday. Kent Wenger Condor Team =========================================================================== Date mail was appended: Tue Sep 26 18:47:30 2006 (1159314466)
Date: Thu, 28 Sep 2006 17:40:51 -0500 (CDT) From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu> To: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu> CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Patrick Brady <patrick__AT__gravity.phys.uwm.edu>, "R. Kent Wenger" <wenger__AT__cs.wisc.edu> Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 On Tue, 26 Sep 2006, Erik A. Espinoza wrote: > Could we have this in RHAS please? Okay, I have 6.8.2 pre-release condor_dagman and condor_submit_dag binaries ready for you. These have the change that the dagman.out file is appended to instead of overwritten. I'm still working on the NFS log file checking; I'll follow up on the appropriate ticket. The binaries are available at: ftp://ftp.cs.wisc.edu/condor/temporary/forligo/ia64_rhas_3/condor_dagman ftp://ftp.cs.wisc.edu/condor/temporary/forligo/ia64_rhas_3/condor_submit_dag ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rh_9/condor_dagman ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rh_9/condor_submit_dag ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rhas_3/condor_dagman ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rhas_3/condor_submit_dag Kent Wenger Condor Team =========================================================================== Date mail was appended: Thu Sep 28 17:44:32 2006 (1159483473)
Date: Fri, 20 Oct 2006 10:07:50 -0700 From: Stuart Anderson <anderson__AT__ligo.caltech.edu> To: condor-support response tracking system <condor-support__AT__cs.wisc.edu> CC: espinoza_e__AT__ligo.caltech.edu, patrick__AT__gravity.phys.uwm.edu, espinoza__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4 We have not had this problem as of condor-6.8.2 with the reduction by 3x in the number of schedd child processes created so this problem ticket can be closed. Thanks. -- Stuart Anderson anderson__AT__ligo.caltech.edu http://www.ligo.caltech.edu/~anderson =========================================================================== Date mail was appended: Fri Oct 20 12:08:14 2006 (1161364094)
Subject: Actions Ticket resolved by gthain =========================================================================== Date of actions: Mon Oct 30 9:34:56 2006 (1162222496)