LIGO Support Ticket 7835

Ticket Information
  Number:      support 7835
  User:        dabrown@physics.syr.edu
  Email:       condorligo__AT__aei.mpg.de,miroslav.shaltev__AT__shaltev.de,atlas_admin__AT__aei.mpg.de,steffen.grunewald__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu,carsten.aulbert__AT__aei.mpg.de,Thomas.Dent__AT__astro.cf.ac.uk,dja.mckechan__AT__gmail.com,ian.harry__AT__astro.cf.ac.uk,miroslav.shaltev__AT__aei.mpg.de
  Status:      resolved
  Assigned To: wenger
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: LIGO dagman restart bug
Date: Tue, 15 Sep 2009 11:48:21 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-15_03:2009-09-01,2009-09-15,2009-09-14 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Kent,

Here is the email with details on the dagman bug we have found.

Cheers,
Duncan.

Begin forwarded message:

> From: Duncan Brown <dabrown__AT__physics.syr.edu>
> Date: September 14, 2009 8:16:47 PM EDT
> To: LIGO-Virgo Compact Coalescing Binaries Group <cbc__AT__gravity.phys.uwm.edu 
> >
> Cc: condor-support response tracking system <condor-support__AT__cs.wisc.edu 
> >, Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
> Subject: LIGO dagman restart bug. Was Re: [CBC] Low mass wk6  
> stalling on atlas
>
> Hi Ian (and Kent),
>
> Looks like Tom has found a genuine dagman bug. Kent, this is  
> critical for us, so I'll send you the log files separately.
>
> If I look at Tom's jobs on atlas1:
>
> [dbrown@atlas1 full_data]$ condor_q -direct schedd tdent
>
> -- Submitter: atlas1.atlas.aei.uni-hannover.de :  
> <10.20.30.1:57045> : atlas1.atlas.aei.uni-hannover.de
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> 690822.0   tdent           9/9  16:20   4+23:58:48 R  0   7.3   
> condor_dagman
> 693783.0   tdent           9/9  19:23   2+18:38:02 I  0   7.3   
> condor_dagman
>
> 2 jobs; 1 idle, 1 running, 0 held
>
> I can see the ihope uber-dag (690822.0) still running. The second  
> dagman process (which won't have the same id when you check) is the  
> full data dagman. Condor is running the full data dag over and over  
> again (you can see this from the line)
>
> 9/15 02:08:05 Duplicate DAGMan PID 10734 is no longer alive; this  
> DAGMan should continue.
>
> dagman is parsing the the job log file until it hits a node that is  
> restarted. Dagman then fails due to an internal bug.
>
> 9/15 02:09:12 Event: ULOG_EXECUTE for Condor Node  
> f0f823110670b25b982eda696a902c72 (765250.0)
> 9/15 02:09:12 Number of idle job procs: 0
> 9/15 02:09:12 Event: ULOG_JOB_TERMINATED for Condor Node  
> 6cc7c3f0914d5f352145006fcf860a4d (762838.0)
> 9/15 02:09:12 Node 6cc7c3f0914d5f352145006fcf860a4d job proc  
> (762838.0) failed with signal 11.
> 9/15 02:09:12 Retrying node 6cc7c3f0914d5f352145006fcf860a4d (retry  
> #1 of 1)...
> 9/15 02:09:12 Number of idle job procs: 0
> 9/15 02:09:12 ERROR "Searched for node for cluster 762838; got -1!!"  
> at line 1151 in file dag.cpp
>
> The condor job log file looks fine to me, so I think this is a bug  
> in dagman reprocessing restarted jobs.
>
> I'm not sure why dagman died originally. It wasn't due to this error:
>
> 9/11 23:51:23 Of 4663 nodes total:
> 9/11 23:51:23  Done     Pre   Queued    Post   Ready   Un-Ready    
> Failed
> 9/11 23:51:23   ===     ===      ===     ===     ===        ===       
> ===
> 9/11 23:51:23  3087       0      348       0       0        
> 1228        0
> 9/11 23:53:44 ******************************************************
> 9/11 23:53:44 ** condor_scheduniv_exec.693783.0 (CONDOR_DAGMAN)  
> STARTING UP
>
> Atlas seems to periodically kicking this dag off and restarting it.
>
> Unfortunately, since dagman can't restart and reconstruct the dag in  
> memory it can't write a rescue dag. Try and remove the ihope dag and  
> resubmit the resulting rescue dag and see what happens.
>
> If it the full data dag keeps cycling like it is, then im me and  
> I'll take a look.
>
> Cheers,
> Duncan.
>
> On Sep 14, 2009, at 5:35 PM, Ian Harry wrote:
>
>> Hi Duncan,
>>
>> Dave's dag is the only one we tried to fix. Tom's hasn't been  
>> altered at all.
>>
>> Ian
>>
>> 2009/9/14 Duncan Brown <dabrown__AT__physics.syr.edu>
>> Hi Ian,
>>
>> Are there any broken dags that you have not tried to fix? I'd like to
>> try and diagnose this problem, but to do that I need to see a DAG and
>> it's log files that haven't been altered in any way. Have you changed
>> Tom's dag or is that in its natural state?
>>
>> Cheers,
>> Duncan.
>>
>> On Sep 14, 2009, at 4:22 PM, Ian Harry wrote:
>>
>> > Hi Carsten,
>> >
>> > My theory about the size of the dagman.out file was not the  
>> problem.
>> > The jobs seem to be idleing because they have been evicted too many
>> > times. On restarting them by hand they encounter (repeatedly) the
>> > error that Gianluca and others have reported.
>> >
>> > To get Dave's dag running again we removed the top ihope dag
>> > (however, the failing sub-dags did *not* produce rescue dags). Then
>> > we removed the .dagman.out files for the sub dags that were causing
>> > this error. Then we emptied the local directory where all the log
>> > files were stored. Then we were able to recover the dag by
>> > resubmitting, with all the failing dags starting from scratch.
>> >
>> > Does anyone have any suggestions for how to recover from this error
>> > without being forced to restart the dagmans from the beginning?
>> >
>> > Cheers
>> >
>> > Ian
>> >
>> > 2009/9/14 Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
>> > On Monday 14 September 2009 18:59:46 Gianluca M Guidi wrote:
>> > > Hi,
>> > >
>> > > I have the same error:
>> > >
>> > > ERROR "Searched for node for cluster 754512; got -1!!" at line
>> > 1151 in
>> > > file dag.cpp
>> >
>> > It seems Ian fixed it by removing the huge log file, Ian, can you
>> > please
>> > elaborate what exactly you did?
>> >
>> > Carsten
>> >
>> > _______________________________________________
>> > CBC web site
>> > http://www.lsc-group.phys.uwm.edu/ligovirgo/cbc
>> > CBC mailing list
>> > CBC__AT__gravity.phys.uwm.edu
>> > http://www.lsc-group.phys.uwm.edu/mailman/listinfo/cbc
>> >
>> >
>> >
>> > --
>> >  
>> ---------------------------------------------------------------------------
>> > Ian Harry
>> > School of Physics & Astronomy
>> > Queens Buildings, The Parade
>> > Cardiff, CF24 3AA
>> > Email: Ian.Harry__AT__astro.cf.ac.uk
>> > Phone: (+44) 29 208 75120
>> > Mobile: (+44) 7890 479090
>> >  
>> ---------------------------------------------------------------------------
>> > _______________________________________________
>> > CBC web site
>> > http://www.lsc-group.phys.uwm.edu/ligovirgo/cbc
>> > CBC mailing list
>> > CBC__AT__gravity.phys.uwm.edu
>> > http://www.lsc-group.phys.uwm.edu/mailman/listinfo/cbc
>>
>> --
>>
>> Duncan Brown                          Room 263-1, Department of  
>> Physics,
>> Assistant Professor of Physics        Syracuse University, NY  
>> 13244, USA
>> Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan
>>
>>
>>
>> _______________________________________________
>> CBC web site
>> http://www.lsc-group.phys.uwm.edu/ligovirgo/cbc
>> CBC mailing list
>> CBC__AT__gravity.phys.uwm.edu
>> http://www.lsc-group.phys.uwm.edu/mailman/listinfo/cbc
>>
>>
>>
>> -- 
>> ---------------------------------------------------------------------------
>> Ian Harry
>> School of Physics & Astronomy
>> Queens Buildings, The Parade
>> Cardiff, CF24 3AA
>> Email: Ian.Harry__AT__astro.cf.ac.uk
>> Phone: (+44) 29 208 75120
>> Mobile: (+44) 7890 479090
>> ---------------------------------------------------------------------------
>> _______________________________________________
>> CBC web site
>> http://www.lsc-group.phys.uwm.edu/ligovirgo/cbc
>> CBC mailing list
>> CBC__AT__gravity.phys.uwm.edu
>> http://www.lsc-group.phys.uwm.edu/mailman/listinfo/cbc
>
> -- 
>
> Duncan Brown                          Room 263-1, Department of  
> Physics,
> Assistant Professor of Physics        Syracuse University, NY 13244,  
> USA
> Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan
>
>
>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date of creation: Tue Sep 15 10:48:32 2009 (1253029716)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Tue Sep 15 11:32:07 2009 (1253032327)
Date: Tue, 15 Sep 2009 12:08:48 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

> Here is the email with details on the dagman bug we have found.

Okay, I finally got the RUST ticket (I guess your second try), and the 
files that you mentioned in the other email.  (Our network seems a bit 
flakey right now, so maybe that's why your original RUST attempt failed.)

Can you send me the original DAG files (at least the DAG file for the
one that's getting restarted and failing), and the node submit files
for that DAG?

Also, did you get any core files from DAGMan?  There are some times
it's hitting an error that should be causing it to exit, but there
are times it just seems to have disappeared without an error message, 
which could indicate a core dump.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Sep 15 12:09:39 2009 (1253034579)
Date: Tue, 15 Sep 2009 12:24:00 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

Hmm, there is a lot of stuff going on in that DAG...

One really interesting thing I just noticed is this:

9/10 21:10:35 submitting: condor_submit -a dag_node_name' '='
   'dba21f987c96551a01629916e66fe88f <truncated for clarity>
9/10 21:15:50 From submit: Submitting job(s)
9/10 21:15:50 From submit: ERROR: Failed to set ClusterId=753023 for job
   753023.0 (110)
9/10 21:15:50 From submit:
9/10 21:15:50 From submit: ERROR: Failed to queue job.
9/10 21:15:50 failed while reading from pipe.
9/10 21:15:50 Read so far: Submitting job(s)ERROR: Failed to set
   ClusterId=753023 for job 753023.0 (110)ERROR: Failed to queue job.
9/10 21:15:50 ERROR: submit attempt failed

So condor_submit took over five minutes to run, and ended up failing! 
DAGMan seemed to handle this error correctly, but it might point to some
deeper problem with the submit machine.  I haven't found out yet exactly
what might cause the 'Failed to set ClusterId' error.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Sep 15 12:24:42 2009 (1253035483)
Date: Tue, 15 Sep 2009 14:48:45 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

You don't have to bother sending more info -- I've managed to reproduce 
the problem with a simplified test case.

Kent Wenger
Condor Team


===========================================================================
Date mail was appended: Tue Sep 15 14:48:55 2009 (1253044136)
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Tue, 15 Sep 2009 16:41:59 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-15_09:2009-09-01,2009-09-15,2009-09-15 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi Kent,

Great, thanks. I think dagman is hitting a bug caused by a sick schedd  
machine; I believe there is another problem that needs to be tracked  
down, but dagman should be robust against it.

Cheers,
Duncan.

On Sep 15, 2009, at 3:48 PM, condor-support response tracking system  
wrote:

> Duncan,
>
> You don't have to bother sending more info -- I've managed to  
> reproduce
> the problem with a simplified test case.
>
> Kent Wenger
> Condor Team
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Tue Sep 15 15:42:12 2009 (1253047332)
Date: Tue, 15 Sep 2009 15:48:44 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

> Great, thanks. I think dagman is hitting a bug caused by a sick schedd
> machine; I believe there is another problem that needs to be tracked
> down, but dagman should be robust against it.

Yes, a sick schedd machine might explain the condor_submit problem.  But 
there is definitely a DAGMan bug separate from that.  I've tracked it
down to the fact that the relevant node job failed by getting killed with 
a signal, versus returning a non-zero exit code.  Those are slightly 
different code paths inside DAGMan, so obviously something is wrong with 
the "abnormal termination" path.  (If the job fails with a non-zero exit 
code, everthing works -- I modified my test to verify that.)

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Sep 15 15:49:00 2009 (1253047740)
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Wed, 16 Sep 2009 11:05:57 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-16_05:2009-09-01,2009-09-16,2009-09-16 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Kent,

Thanks. This is a high priority bug for us, so let us know when you  
want us to test code.

Cheers,
Duncan.

On Sep 15, 2009, at 4:49 PM, condor-support response tracking system  
wrote:

> Duncan,
>
>> Great, thanks. I think dagman is hitting a bug caused by a sick  
>> schedd
>> machine; I believe there is another problem that needs to be tracked
>> down, but dagman should be robust against it.
>
> Yes, a sick schedd machine might explain the condor_submit problem.   
> But
> there is definitely a DAGMan bug separate from that.  I've tracked it
> down to the fact that the relevant node job failed by getting killed  
> with
> a signal, versus returning a non-zero exit code.  Those are slightly
> different code paths inside DAGMan, so obviously something is wrong  
> with
> the "abnormal termination" path.  (If the job fails with a non-zero  
> exit
> code, everthing works -- I modified my test to verify that.)
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Wed Sep 16 10:06:09 2009 (1253113569)
Date: Wed, 16 Sep 2009 15:38:21 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

>> Looks like Tom has found a genuine dagman bug. Kent, this is
>> critical for us, so I'll send you the log files separately.

I have a fix for this that I'm just running through all of the tests right 
now.  Unless some unexpected problem crops up, I should be able to give 
you binaries tomorrow (these would be 7.4.0 pre-release).

Which platforms do you need?

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Sep 16 15:38:30 2009 (1253133511)
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Wed, 16 Sep 2009 16:41:20 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-16_08:2009-09-01,2009-09-16,2009-09-16 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi Kent,

Debian Lenny and CentOS 5.3, both x86_64.

Cheers,
Duncan.

On Sep 16, 2009, at 4:38 PM, condor-support response tracking system  
wrote:

> Duncan,
>
>>> Looks like Tom has found a genuine dagman bug. Kent, this is
>>> critical for us, so I'll send you the log files separately.
>
> I have a fix for this that I'm just running through all of the tests  
> right
> now.  Unless some unexpected problem crops up, I should be able to  
> give
> you binaries tomorrow (these would be 7.4.0 pre-release).
>
> Which platforms do you need?
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Wed Sep 16 15:41:31 2009 (1253133692)
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Wed, 16 Sep 2009 17:06:35 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-16_08:2009-09-01,2009-09-16,2009-09-16 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Kent,

Actually, while both would be nice, the highest priority is lenny:

[dbrown@atlas3 ~]$ uname -a
Linux atlas3.atlas.aei.uni-hannover.de 2.6.27.30-atlas-generic #1 SMP  
Mon Aug 17 11:14:08 CEST 2009 x86_64 GNU/Linux
[dbrown@atlas3 ~]$ cat /etc/issue
Debian GNU/Linux 5.0 \n \l

Thanks for looking at this for us.

Cheers,
Duncan.

On Sep 16, 2009, at 4:38 PM, condor-support response tracking system  
wrote:

> Duncan,
>
>>> Looks like Tom has found a genuine dagman bug. Kent, this is
>>> critical for us, so I'll send you the log files separately.
>
> I have a fix for this that I'm just running through all of the tests  
> right
> now.  Unless some unexpected problem crops up, I should be able to  
> give
> you binaries tomorrow (these would be 7.4.0 pre-release).
>
> Which platforms do you need?
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Wed Sep 16 16:06:47 2009 (1253135207)
Date: Thu, 17 Sep 2009 10:29:30 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

> Debian Lenny and CentOS 5.3, both x86_64.

You should find what you need here:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.4.0-prerelease-2009-09-17/

(I think the x86_64_rhap_5 build is what you want for Centos 5.3 -- let me 
know if that doesn't work.)

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Sep 17 10:30:23 2009 (1253201424)
CC: Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>,
 "Miroslav V.Shaltev" <miroslav.shaltev__AT__shaltev.de>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Thu, 17 Sep 2009 11:37:44 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-17_07:2009-09-17,2009-09-17,2009-09-17 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Kent,

Thanks. Carsten, please can you install the new condor_dagman binary  
on atlas?

Cheers,
Duncan.

On Sep 17, 2009, at 11:30 AM, condor-support response tracking system  
wrote:

> Duncan,
>
>> Debian Lenny and CentOS 5.3, both x86_64.
>
> You should find what you need here:
>
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.4.0-prerelease-2009-09-17/
>
> (I think the x86_64_rhap_5 build is what you want for Centos 5.3 --  
> let me
> know if that doesn't work.)
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Thu Sep 17 10:38:02 2009 (1253201883)
CC: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>,
 "Miroslav V.Shaltev" <miroslav.shaltev__AT__shaltev.de>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: Duncan Brown <dabrown__AT__physics.syr.edu>, atlas_admin__AT__aei.mpg.de,
 Steffen Grunewald <steffen.grunewald__AT__aei.mpg.de>
Subject: Re: [CondorLIGO] Re: [condor-support #7835] LIGO dagman restart bug
Date: Thu, 17 Sep 2009 09:24:42 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

I am not sure if Carsten has started his vacation time yet so I have  
added the Atlas admin list to this ticket.

More generally is there a LIGO Debian admin that will be able to fill  
in as the Debian point of contact for Condor/VDT and monitor the  
Condor/LIGO mailing list while Carsten is taking some time off?

Thanks.

On Sep 17, 2009, at 8:37 AM, Duncan Brown wrote:

> Hi Kent,
>
> Thanks. Carsten, please can you install the new condor_dagman binary  
> on atlas?
>
> Cheers,
> Duncan.
>
> On Sep 17, 2009, at 11:30 AM, condor-support response tracking  
> system wrote:
>
>> Duncan,
>>
>>> Debian Lenny and CentOS 5.3, both x86_64.
>>
>> You should find what you need here:
>>
>> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.4.0-prerelease-2009-09-17/
>>
>> (I think the x86_64_rhap_5 build is what you want for Centos 5.3 --  
>> let me
>> know if that doesn't work.)
>>
>> Kent Wenger
>> Condor Team
>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
>> * Ticket Email List: dabrown__AT__physics.syr.edu,
>>
>
> -- 
>
> Duncan Brown                          Room 263-1, Department of  
> Physics,
> Assistant Professor of Physics        Syracuse University, NY 13244,  
> USA
> Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan
>
>
>
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo


===========================================================================
Date mail was appended: Thu Sep 17 11:24:58 2009 (1253204699)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: atlas_admin__AT__aei.mpg.de
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Thu, 17 Sep 2009 19:34:44 +0200
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,        Duncan Brown
 <dabrown__AT__physics.syr.edu>,        Steffen Grunewald
 <steffen.grunewald__AT__aei.mpg.de>,
 "condor-support response tracking system" <condor-support__AT__cs.wisc.edu>,
 "Condor/LIGO mailing list" <condorligo__AT__aei.mpg.de>
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Thursday 17 September 2009 18:24:42 Stuart Anderson wrote:
> I am not sure if Carsten has started his vacation time yet so I have
> added the Atlas admin list to this ticket.
> 
> More generally is there a LIGO Debian admin that will be able to fill
> in as the Debian point of contact for Condor/VDT and monitor the
> Condor/LIGO mailing list while Carsten is taking some time off?

I think Henning is :)

I won't be able to insert the binaries tonight as I'm going to leave in about 
10 minutes.

Carsten

===========================================================================
Date mail was appended: Thu Sep 17 12:34:59 2009 (1253208899)
Subject: Actions

Status changed from open to pending by wenger
===========================================================================
Date of actions: Thu Sep 17 12:39:13 2009 (1253209153)
CC: atlas_admin__AT__aei.mpg.de,        condor-support response tracking system
 <condor-support__AT__cs.wisc.edu>,        Condor/LIGO mailing list
 <condorligo__AT__aei.mpg.de>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 16:59:26 +0200
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-18_11:2009-09-17,2009-09-18,2009-09-18 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
 reason=mlx engine=5.0.0-0908210000 definitions=main-0909190035
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Carsten and Henning,

Any news on getting this installed? This is critical for the cbc group.

Cheers,
Duncan.

On Sep 17, 2009, at 7:34 PM, Carsten Aulbert wrote:

> On Thursday 17 September 2009 18:24:42 Stuart Anderson wrote:
>> I am not sure if Carsten has started his vacation time yet so I have
>> added the Atlas admin list to this ticket.
>>
>> More generally is there a LIGO Debian admin that will be able to fill
>> in as the Debian point of contact for Condor/VDT and monitor the
>> Condor/LIGO mailing list while Carsten is taking some time off?
>
> I think Henning is :)
>
> I won't be able to insert the binaries tonight as I'm going to leave  
> in about
> 10 minutes.
>
> Carsten
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Sat Sep 19  9:59:47 2009 (1253372387)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 17:12:04 +0200
CC: atlas_admin__AT__aei.mpg.de,
 "condor-support response tracking system" <condor-support__AT__cs.wisc.edu>,
 "Condor/LIGO mailing list" <condorligo__AT__aei.mpg.de>
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Duncan,

On Saturday 19 September 2009 16:59:26 Duncan Brown wrote:
> Hi Carsten and Henning,
> 
> Any news on getting this installed? This is critical for the cbc group.

I've replaced it on atlas3 right now as this is our "test" head nodes for new 
software - however, I'm not sure on which node it should be installed as well 
for testing as people might have set up their job there.

Carsten


===========================================================================
Date mail was appended: Sat Sep 19 10:12:15 2009 (1253373136)
CC: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>, atlas_admin__AT__aei.mpg.de,
 Thomas Dent <Thomas.Dent__AT__astro.cf.ac.uk>, David McKechan
 <dja.mckechan__AT__gmail.com>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 17:22:49 +0200
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-18_11:2009-09-17,2009-09-18,2009-09-18 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
 reason=mlx engine=5.0.0-0908210000 definitions=main-0909190040
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Carsten,

Thanks. Tom, Dave which head node are your dags on? If they are on  
atlas3, can you restart them?

Cheers,
Duncan.

On Sep 19, 2009, at 5:12 PM, Carsten Aulbert wrote:

> Hi Duncan,
>
> On Saturday 19 September 2009 16:59:26 Duncan Brown wrote:
>> Hi Carsten and Henning,
>>
>> Any news on getting this installed? This is critical for the cbc  
>> group.
>
> I've replaced it on atlas3 right now as this is our "test" head  
> nodes for new
> software - however, I'm not sure on which node it should be  
> installed as well
> for testing as people might have set up their job there.
>
> Carsten
>
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Sat Sep 19 10:23:18 2009 (1253373799)
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;
 s=gamma;        h=domainkey-signature:mime-version:received:in-reply-to:references
         :date:message-id:subject:from:to:cc:content-type        
 :content-transfer-encoding;        bh=Mb4qjL7Wii2dNFqsA5IOA0db3Nf8O18+oDJPfUjmvUY=;
        b=Iyy6Pryz2KtiTzw9kwXPnRATXi7aFw/ZXcIP4+zkE6YtuLnQGrQia7UZECcRx0Xee2
         3cH67t2HDANm0EZXcFblbLGXxIjhBA+CDFe4g+3+PP26cLWpljp7GmgFbF9oIEqsrKPl
         6Kc2/lLD23IE+h6r2VEpoeLMR4Cbt3R/LS0dg=
Domainkey-Signature: a=rsa-sha1; c=nofws;        d=gmail.com; s=gamma;    
    h=mime-version:in-reply-to:references:date:message-id:subject:from:to  
       :cc:content-type:content-transfer-encoding;        b=ALvwDIpSiGq0avdzl7BcGYXrAXNsVOG9esXbJpQf3Nfb8JE5kH9mTNuqKpwtZiaJzN
         IcUf2Qu3cm1tr8mkzEnQIanveTMJ+YyHYXqVWpPREsxkKHwhH9c9EsBv7+1U8KlQENmt
         VpHnzBxIzpYhk1MDmRqfPVhMFoCWd2pt1sEI4=
Date: Sat, 19 Sep 2009 17:26:56 +0200
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO 	dagman restart bug
From: David McKechan <dja.mckechan__AT__gmail.com>
To: Duncan Brown <dabrown__AT__physics.syr.edu>
CC: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>,        condor-support
 response tracking system <condor-support__AT__cs.wisc.edu>,
 "Condor/LIGO mailing list" <condorligo__AT__aei.mpg.de>,
 atlas_admin__AT__aei.mpg.de, Thomas Dent <Thomas.Dent__AT__astro.cf.ac.uk>
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

On Sat, Sep 19, 2009 at 5:22 PM, Duncan Brown <dabrown__AT__physics.syr.edu> wro=
te:
> Hi Carsten,
>
> Thanks. Tom, Dave which head node are your dags on? If they are on atlas3,
> can you restart them?

I'm on atlas1 and I think Tom is too.

Cheers,
Dave


> Cheers,
> Duncan.
>
> On Sep 19, 2009, at 5:12 PM, Carsten Aulbert wrote:
>
>> Hi Duncan,
>>
>> On Saturday 19 September 2009 16:59:26 Duncan Brown wrote:
>>>
>>> Hi Carsten and Henning,
>>>
>>> Any news on getting this installed? This is critical for the cbc group.
>>
>> I've replaced it on atlas3 right now as this is our "test" head nodes for
>> new
>> software - however, I'm not sure on which node it should be installed as
>> well
>> for testing as people might have set up their job there.
>>
>> Carsten
>>
>> _______________________________________________
>> Condorligo mailing list
>> Condorligo__AT__aei.mpg.de
>> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
>
> --
>
> Duncan Brown =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Room 263-=
1, Department of Physics,
> Assistant Professor of Physics =A0 =A0 =A0 =A0Syracuse University, NY 132=
44, USA
> Phone: (315) 443 5993 =A0 =A0 =A0 =A0 =A0 =A0 http://www.gravity.phy.syr.=
edu/~duncan
>
>
>
>



--=20
Help me raise money for Barnardo's http://www.waitup.org.uk

===========================================================================
Date mail was appended: Sat Sep 19 10:27:09 2009 (1253374029)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: David McKechan <dja.mckechan__AT__gmail.com>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 17:31:39 +0200
CC: Duncan Brown <dabrown__AT__physics.syr.edu>,
 "condor-support response tracking system" <condor-support__AT__cs.wisc.edu>,
 "Condor/LIGO mailing list" <condorligo__AT__aei.mpg.de>,
 atlas_admin__AT__aei.mpg.de, Thomas Dent <Thomas.Dent__AT__astro.cf.ac.uk>
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Saturday 19 September 2009 17:26:56 David McKechan wrote:
> On Sat, Sep 19, 2009 at 5:22 PM, Duncan Brown <dabrown__AT__physics.syr.edu> 
wrote:
> > Hi Carsten,
> >
> > Thanks. Tom, Dave which head node are your dags on? If they are on
> > atlas3, can you restart them?
> 
> I'm on atlas1 and I think Tom is too.

OK, please try again at X:40 (where X is your current full hour offset to UTC) 
- i.e. give me about 9 minutes

Carsten

===========================================================================
Date mail was appended: Sat Sep 19 10:31:50 2009 (1253374311)
Date: Sat, 19 Sep 2009 16:34:15 +0100 (BST)
From: Thomas Dent <Thomas.Dent__AT__astro.cf.ac.uk>
To: Duncan Brown <dabrown__AT__physics.syr.edu>
CC: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>,        condor-support
 response tracking system <condor-support__AT__cs.wisc.edu>,        Condor/LIGO
 mailing list <condorligo__AT__aei.mpg.de>,        atlas_admin__AT__aei.mpg.de,
 Thomas Dent <Thomas.Dent__AT__astro.cf.ac.uk>,        David McKechan
 <dja.mckechan__AT__gmail.com>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
X-Cu-PHYSX-Virus-Scan: ClamAV did not find anything.
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu


... atlas1.

On Sat, 19 Sep 2009, Duncan Brown wrote:

> Hi Carsten,
>
> Thanks. Tom, Dave which head node are your dags on? If they are on atlas3, 
> can you restart them?
>
> Cheers,
> Duncan.
>
> On Sep 19, 2009, at 5:12 PM, Carsten Aulbert wrote:
>
>> Hi Duncan,
>> 
>> On Saturday 19 September 2009 16:59:26 Duncan Brown wrote:
>>> Hi Carsten and Henning,
>>> 
>>> Any news on getting this installed? This is critical for the cbc group.
>> 
>> I've replaced it on atlas3 right now as this is our "test" head nodes for 
>> new
>> software - however, I'm not sure on which node it should be installed as 
>> well
>> for testing as people might have set up their job there.
>> 
>> Carsten
>> 
>> _______________________________________________
>> Condorligo mailing list
>> Condorligo__AT__aei.mpg.de
>> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
>
>

-- 
-------------------
Physics & Astronomy
Cardiff University

===========================================================================
Date mail was appended: Sat Sep 19 10:34:36 2009 (1253374476)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: David McKechan <dja.mckechan__AT__gmail.com>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 17:51:09 +0200
CC: Duncan Brown <dabrown__AT__physics.syr.edu>,
 "condor-support response tracking system" <condor-support__AT__cs.wisc.edu>,
 "Condor/LIGO mailing list" <condorligo__AT__aei.mpg.de>,
 atlas_admin__AT__aei.mpg.de, Thomas Dent <Thomas.Dent__AT__astro.cf.ac.uk>
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Saturday 19 September 2009 17:26:56 David McKechan wrote:
> On Sat, Sep 19, 2009 at 5:22 PM, Duncan Brown <dabrown__AT__physics.syr.edu> 
wrote:
> > Hi Carsten,
> >
> > Thanks. Tom, Dave which head node are your dags on? If they are on
> > atlas3, can you restart them?
> 
> I'm on atlas1 and I think Tom is too.

OK, atlas1 and atlas3 are now using the new binaries, please let us know, if 
this fixes the problem.

Carsten

===========================================================================
Date mail was appended: Sat Sep 19 10:51:26 2009 (1253375487)
CC: David McKechan <dja.mckechan__AT__gmail.com>,        Thomas Dent
 <Thomas.Dent__AT__astro.cf.ac.uk>,        condor-support response tracking
 system <condor-support__AT__cs.wisc.edu>,        Condor/LIGO mailing list
 <condorligo__AT__aei.mpg.de>,        atlas_admin__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 17:54:09 +0200
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-18_11:2009-09-17,2009-09-18,2009-09-18 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
 reason=mlx engine=5.0.0-0908210000 definitions=main-0909190044
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Carsten,

Thanks. Tom, Dave, I think I know a way to save the jobs you have  
already run. Can you come and find me before you remove and resubmit  
your dags?

Cheers,
Duncan.

On Sep 19, 2009, at 5:51 PM, Carsten Aulbert wrote:

> On Saturday 19 September 2009 17:26:56 David McKechan wrote:
>> On Sat, Sep 19, 2009 at 5:22 PM, Duncan Brown <dabrown__AT__physics.syr.edu 
>> >
> wrote:
>>> Hi Carsten,
>>>
>>> Thanks. Tom, Dave which head node are your dags on? If they are on
>>> atlas3, can you restart them?
>>
>> I'm on atlas1 and I think Tom is too.
>
> OK, atlas1 and atlas3 are now using the new binaries, please let us  
> know, if
> this fixes the problem.
>
> Carsten
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Sat Sep 19 10:54:37 2009 (1253375678)
CC: David McKechan <dja.mckechan__AT__gmail.com>,        Thomas Dent
 <Thomas.Dent__AT__astro.cf.ac.uk>,        condor-support response tracking
 system <condor-support__AT__cs.wisc.edu>,        Condor/LIGO mailing list
 <condorligo__AT__aei.mpg.de>,        atlas_admin__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 18:38:50 +0200
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-18_11:2009-09-17,2009-09-18,2009-09-18 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
 reason=mlx engine=5.0.0-0908210000 definitions=main-0909190049
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Carsten, Kent,

Thanks, it looks like it's working.

Carsten, please can you do the altas-specific magic that needs to be  
done to get Tom's jobs out of the idle state and into the running state?

Dave, I'll show you how to do this as well.

Cheers,
Duncan.

On Sep 19, 2009, at 5:51 PM, Carsten Aulbert wrote:

> On Saturday 19 September 2009 17:26:56 David McKechan wrote:
>> On Sat, Sep 19, 2009 at 5:22 PM, Duncan Brown <dabrown__AT__physics.syr.edu 
>> >
> wrote:
>>> Hi Carsten,
>>>
>>> Thanks. Tom, Dave which head node are your dags on? If they are on
>>> atlas3, can you restart them?
>>
>> I'm on atlas1 and I think Tom is too.
>
> OK, atlas1 and atlas3 are now using the new binaries, please let us  
> know, if
> this fixes the problem.
>
> Carsten
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Sat Sep 19 11:39:16 2009 (1253378357)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [Atlas_admin] Re: [CondorLIGO] Re: [condor-support #7835]
 LIGO dagman restart bug
Date: Sat, 19 Sep 2009 18:46:51 +0200
CC: David McKechan <dja.mckechan__AT__gmail.com>,        Thomas Dent
 <Thomas.Dent__AT__astro.cf.ac.uk>,
 "condor-support response tracking system" <condor-support__AT__cs.wisc.edu>,
 "Condor/LIGO mailing list" <condorligo__AT__aei.mpg.de>,
 atlas_admin__AT__aei.mpg.de
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Saturday 19 September 2009 18:38:50 Duncan Brown wrote:
> Hi Carsten, Kent,
> 
> Thanks, it looks like it's working.
> 
> Carsten, please can you do the altas-specific magic that needs to be
> done to get Tom's jobs out of the idle state and into the running state?

It seems I don't have to intervene:

RUN/HELD/IDLE | atlas1         atlas2         atlas3         atlas4        | 
Total
--------------+------------------------------------------------------------+--------------
ajith         |                0/520/0                                     | 
0/520/0
ajw           |                               6/0/0                        | 
6/0/0
asad          | 2/0/0          0/0/1                                       | 
2/0/1
christian     | 2/0/0                                                      | 
2/0/0
dietz         |                292/0/16                                    | 
292/0/16
hpletsch      | 734/0/25889                                                | 
734/0/25889
jveitch       |                701/293/0                                   | 
701/293/0
lppekows      | 0/92/0                                       7/0/53        | 
7/92/53
mckechan      | 2/0/2                                                      | 
2/0/2
mwas          |                               12/0/106                     | 
12/0/106
paleac        |                               1/0/0                        | 
1/0/0
pankow        |                               7/0/33                       | 
7/0/33
rahul         |                               4/0/2                        | 
4/0/2
tdent         | 41/0/96                                                    | 
41/0/96
vedovato      |                5/344/0        1/86/0                       | 
6/430/0
volodya       |                4903/0/659                                  | 
4903/0/659
--------------+------------------------------------------------------------+--------------
Total         | 781/92/25987   5901/1157/676  31/86/141      7/0/53        | 
6720/1335/26857


already 41 jobs running 96 waiting for slots to become available.

Carsten

===========================================================================
Date mail was appended: Sat Sep 19 11:47:23 2009 (1253378843)
Date: Tue, 22 Sep 2009 17:46:19 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

I just wanted to confirm that the fix is working on your end and I can 
close this ticket.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Sep 22 17:46:42 2009 (1253659603)
CC: atlas_admin__AT__aei.mpg.de,
 "Miroslav V.Shaltev" <miroslav.shaltev__AT__shaltev.de>,
 Thomas Dent <Thomas.Dent__AT__astro.cf.ac.uk>,        Condor/LIGO mailing list
 <condorligo__AT__aei.mpg.de>,        David McKechan <dja.mckechan__AT__gmail.com>,
 Ian Harry <ian.harry__AT__astro.cf.ac.uk>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [CondorLIGO] Re: [condor-support #7835] LIGO dagman restart bug
Date: Fri, 25 Sep 2009 12:01:16 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-09-25_09:2009-09-22,2009-09-25,2009-09-25 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Kent,

Yes, the problem appears to be fixed in this version of dagman, thanks.

Stuart, please can you arrange to have this code installed on all the  
LDG clusters?

BTW, when trying to restart some failed dags after dropping in the new  
dagman, we saw messages like

09/25 17:35:38 Warning: log file for node  
6c70c026aa02ed21f1b5eb20a95ac5c0 is already monitored
09/25 17:35:38 Warning: log file for node  
1a88973d91b1240c6a85ddbf632c2355 is already monitored
09/25 17:35:38 Warning: log file for node  
d01f74620d5c57b84c35d8dd9a4548d5 is already monitored

Any idea where these come from, what they mean and how to get rid of  
them?

Cheers,
Duncan.

On Sep 22, 2009, at 6:46 PM, condor-support response tracking system  
wrote:

> Duncan,
>
> I just wanted to confirm that the fix is working on your end and I can
> close this ticket.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, condorligo__AT__aei.mpg.de,miroslav.shaltev__AT__shaltev.de 
> ,atlas_admin__AT__aei.mpg.de,steffen.grunewald__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu 
> ,carsten.aulbert__AT__aei.mpg.de,Thomas.Dent__AT__astro.cf.ac.uk,dja.mckechan__AT__gmail.com
>
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Fri Sep 25 11:02:27 2009 (1253894547)
Date: Fri, 25 Sep 2009 11:21:08 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

> Yes, the problem appears to be fixed in this version of dagman, thanks.

Okay, great!  I'll resolve the ticket.

> BTW, when trying to restart some failed dags after dropping in the new
> dagman, we saw messages like
>
> 09/25 17:35:38 Warning: log file for node
> 6c70c026aa02ed21f1b5eb20a95ac5c0 is already monitored
> 09/25 17:35:38 Warning: log file for node
> 1a88973d91b1240c6a85ddbf632c2355 is already monitored
> 09/25 17:35:38 Warning: log file for node
> d01f74620d5c57b84c35d8dd9a4548d5 is already monitored
>
> Any idea where these come from, what they mean and how to get rid of
> them?

This is just some information from the new "lazy log files" code.  These 
warnings don't actually indicate a problem, they are more for debugging 
the code.

It turns out that I have them set at a debug level that would turn off a 
lot of other stuff if you get rid of them.  Maybe I should change the 
debug level so that it's easier to get rid of them?

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Sep 25 11:21:42 2009 (1253895702)
Date: Fri, 25 Sep 2009 12:58:09 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

> BTW, when trying to restart some failed dags after dropping in the new
> dagman, we saw messages like
>
> 09/25 17:35:38 Warning: log file for node
> 6c70c026aa02ed21f1b5eb20a95ac5c0 is already monitored
> 09/25 17:35:38 Warning: log file for node
> 1a88973d91b1240c6a85ddbf632c2355 is already monitored
> 09/25 17:35:38 Warning: log file for node
> d01f74620d5c57b84c35d8dd9a4548d5 is already monitored
>
> Any idea where these come from, what they mean and how to get rid of
> them?

I've changed the code so that these messages aren't printed at the default 
debug setting.  (You'll see this change in the "real" 7.4.0.)

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Sep 25 12:58:33 2009 (1253901514)
CC: Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Fri, 25 Sep 2009 12:03:11 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

As discussed in the Condor/LIGO call today, LIGO will deploy a set of
7.4.0 pre-release DAGMan binaries that include this fix as well as the
fix for ticket #19456 which was approved for inclusion in 7.4.0.

The expectation is that these will be available on Monday.

Thanks.

On Sep 25, 2009, at 9:02 AM, condor-support response tracking system  
wrote:

> Hi Kent,
>
> Yes, the problem appears to be fixed in this version of dagman,  
> thanks.
>
> Stuart, please can you arrange to have this code installed on all the
> LDG clusters?
>
> BTW, when trying to restart some failed dags after dropping in the new
> dagman, we saw messages like
>
> 09/25 17:35:38 Warning: log file for node
> 6c70c026aa02ed21f1b5eb20a95ac5c0 is already monitored
> 09/25 17:35:38 Warning: log file for node
> 1a88973d91b1240c6a85ddbf632c2355 is already monitored
> 09/25 17:35:38 Warning: log file for node
> d01f74620d5c57b84c35d8dd9a4548d5 is already monitored
>
> Any idea where these come from, what they mean and how to get rid of
> them?
>
> Cheers,
> Duncan.
>
> On Sep 22, 2009, at 6:46 PM, condor-support response tracking system
> wrote:
>
>> Duncan,
>>
>> I just wanted to confirm that the fix is working on your end and I  
>> can
>> close this ticket.
>>
>> Kent Wenger
>> Condor Team
>>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Fri Sep 25 14:03:29 2009 (1253905411)
Date: Mon, 28 Sep 2009 12:00:02 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug

Duncan,

I have new DAGMan binaries available at the the following URL:
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.4.0-prerelease-2009-09-26/

These binaries have the following 7.4.0 improvments:

* The condor-support #7835/gittrac #744 problem (DAGMan errors out on
abnormal job termination in recovery mode) is fixed.

* The condor-admin #19456/gittrac #622 problem (invalid rescue DAG generated
because of event reading problems in recovery mode) is fixed.

* The "Warning: log file for node xxx already monitored" messages are
not printed at the default verbosity.

They also have all improvements from 7.3.2, such as the "lazy log file
evaluation" and the automatic default log file for node jobs that don't
specify a log file.  You may also want to explore setting the new
DAGMAN_USER_LOG_SCAN_INTERVAL configuration variable to a smaller value
than its default of 5 seconds, for DAGs that have many very small jobs.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Sep 28 12:00:44 2009 (1254157245)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Mon Sep 28 12:01:43 2009 (1254157303)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Wed Sep 30 16:23:24 2009 (1254345804)
CC: dabrown__AT__physics.syr.edu, condorligo__AT__aei.mpg.de,
 miroslav.shaltev__AT__shaltev.de, atlas_admin__AT__aei.mpg.de,
 steffen.grunewald__AT__aei.mpg.de, carsten.aulbert__AT__aei.mpg.de,
 Thomas.Dent__AT__astro.cf.ac.uk, dja.mckechan__AT__gmail.com,
 ian.harry__AT__astro.cf.ac.uk
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Wed, 30 Sep 2009 14:22:55 -0700
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

These binaries have been installed on one of the head nodes at LDAS- 
CIT for testing. I would appreciate it if the same Atlas nodes that  
tested a previous pre-release where updated to this version for re- 
testing as well before we more widely deploy on these on the LDG.

Thanks.

On Sep 28, 2009, at 10:00 AM, condor-support response tracking system  
wrote:

> Duncan,
>
> I have new DAGMan binaries available at the the following URL:
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.4.0-prerelease-2009-09-26/
>
> These binaries have the following 7.4.0 improvments:
>
> * The condor-support #7835/gittrac #744 problem (DAGMan errors out on
> abnormal job termination in recovery mode) is fixed.
>
> * The condor-admin #19456/gittrac #622 problem (invalid rescue DAG  
> generated
> because of event reading problems in recovery mode) is fixed.
>
> * The "Warning: log file for node xxx already monitored" messages are
> not printed at the default verbosity.
>
> They also have all improvements from 7.3.2, such as the "lazy log file
> evaluation" and the automatic default log file for node jobs that  
> don't
> specify a log file.  You may also want to explore setting the new
> DAGMAN_USER_LOG_SCAN_INTERVAL configuration variable to a smaller  
> value
> than its default of 5 seconds, for DAGs that have many very small  
> jobs.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, condorligo__AT__aei.mpg.de,miroslav.shaltev__AT__shaltev.de 
> ,atlas_admin__AT__aei.mpg.de,steffen.grunewald__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu 
> ,carsten.aulbert__AT__aei.mpg.de,Thomas.Dent__AT__astro.cf.ac.uk,dja.mckechan__AT__gmail.com 
> ,ian.harry__AT__astro.cf.ac.uk
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Wed Sep 30 16:23:24 2009 (1254345804)
From: "Miroslav V.Shaltev" <miroslav.shaltev__AT__aei.mpg.de>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: Re: [condor-support #7835] LIGO dagman restart bug
Date: Thu, 1 Oct 2009 08:40:42 +0200
CC: condor-support__AT__cs.wisc.edu, dabrown__AT__physics.syr.edu,
 condorligo__AT__aei.mpg.de,        atlas_admin__AT__aei.mpg.de,
 steffen.grunewald__AT__aei.mpg.de,        carsten.aulbert__AT__aei.mpg.de,
 Thomas.Dent__AT__astro.cf.ac.uk,        dja.mckechan__AT__gmail.com,
 ian.harry__AT__astro.cf.ac.uk
X-Mimetrack: Itemize by SMTP Server on intranet/aei-hannover(Release
 8.5HF224 | March 30, 2009) at 10/01/2009 08:40:18,	Serialize by Router on
 intranet/aei-hannover(Release 8.5HF224 | March 30, 2009) at 10/01/2009
 08:40:28,	Serialize complete at 10/01/2009 08:40:28
X-PMX-Version: 5.5.8.383112, Antispam-Engine: 2.7.2.376379, Antispam-Data:
 2009.10.1.62719
X-Perlmx-Spam: Gauge=IIIIIIII, Probability=8%, Report=' BODY_SIZE_2000_2999
 0, BODY_SIZE_5000_LESS 0, BODY_SIZE_7000_LESS 0, __BOUNCE_CHALLENGE_SUBJ
 0, __CD 0, __CP_URI_IN_BODY 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0,
 __FRAUD_419_CONTACT_NUM 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0,
 __MIME_VERSION 0, __SANE_MSGID 0, __TO_MALFORMED_2 0, __URI_NS ,
 __USER_AGENT 0'
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

hi all,

atlas1 

and 

atlas3

are uptodate with the new pre-release of condor_*dag*.

cheers,
miroslav

On Wednesday 30 September 2009, Stuart Anderson wrote:
> These binaries have been installed on one of the head nodes at LDAS-
> CIT for testing. I would appreciate it if the same Atlas nodes that
> tested a previous pre-release where updated to this version for re-
> testing as well before we more widely deploy on these on the LDG.
>
> Thanks.
>
> On Sep 28, 2009, at 10:00 AM, condor-support response tracking system
>
> wrote:
> > Duncan,
> >
> > I have new DAGMan binaries available at the the following URL:
> > ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.4.0-prerelease-20
> >09-09-26/
> >
> > These binaries have the following 7.4.0 improvments:
> >
> > * The condor-support #7835/gittrac #744 problem (DAGMan errors out on
> > abnormal job termination in recovery mode) is fixed.
> >
> > * The condor-admin #19456/gittrac #622 problem (invalid rescue DAG
> > generated
> > because of event reading problems in recovery mode) is fixed.
> >
> > * The "Warning: log file for node xxx already monitored" messages are
> > not printed at the default verbosity.
> >
> > They also have all improvements from 7.3.2, such as the "lazy log file
> > evaluation" and the automatic default log file for node jobs that
> > don't
> > specify a log file.  You may also want to explore setting the new
> > DAGMAN_USER_LOG_SCAN_INTERVAL configuration variable to a smaller
> > value
> > than its default of 5 seconds, for DAGs that have many very small
> > jobs.
> >
> > Kent Wenger
> > Condor Team
> >
> >
> > ========================================
> > MESSAGE INFORMATION
> > ========================================
> > * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> > * Ticket Email List: dabrown__AT__physics.syr.edu,
> > condorligo__AT__aei.mpg.de,miroslav.shaltev__AT__shaltev.de
> > ,atlas_admin__AT__aei.mpg.de,steffen.grunewald__AT__aei.mpg.de,anderson__AT__ligo.caltec
> >h.edu
> > ,carsten.aulbert__AT__aei.mpg.de,Thomas.Dent__AT__astro.cf.ac.uk,dja.mckechan@gmail
> >.com ,ian.harry__AT__astro.cf.ac.uk
>
> --
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson


-- 
Miroslav Shaltev
Albert Einstein Institute
Callinstr 38
D-30167 Hannover, Germany

Phone: +49-(0)511-762-17103 (room 033)


===========================================================================
Date mail was appended: Thu Oct  1  2:06:03 2009 (1254380764)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Oct 23 13:08:18 2009 (1256321298)