LIGO Support Ticket 19609

Ticket Information
  Number:      admin 19609
  User:        skoranda@gravity.phys.uwm.edu
  Email:       condorligo__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,rosso__AT__gravity.phys.uwm.edu,carsten.aulbert__AT__aei.mpg.de
  Status:      resolved
  Assigned To: gthain
Date: Tue, 18 Aug 2009 10:24:58 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: condorligo__AT__aei.mpg.de, Stuart Anderson <anderson__AT__ligo.caltech.edu>,
 Duncan Brown <dabrown__AT__physics.syr.edu>,        Ross Oldenburg
 <rosso__AT__gravity.phys.uwm.edu>
Subject: LIGO: mangled paths with autofs and condor
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi,

I am forwarding this to condor-admin with the subject line
"LIGO" to officially create a LIGO support ticket.

I am also asking officially that this become the highest
priority LIGO ticket at this time. It is seriously impacting a
high profile science analysis and users are leaving our
cluster to compute elsewhere.

I would appreciate any effort the Condor team can put into it.
Please let us know what you need from us.

Thanks,

Scott

----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----

Date: Thu, 13 Aug 2009 16:43:43 -0500
From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
To: ligosysadm__AT__gravity.psu.edu
Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
Subject: [CondorLIGO] mangled paths with autofs and condor

Hello,

We have been seeing a strange problem the past several weeks that looks  
like an issue with the Linux NFS client, especially with respect to  
automounted filesystems.  In summary, what happens is that if an NFS  
server crashes or autofs hiccups or restarts, Condor mangles absolute  
paths, dropping the mount point from the head of that path.  For example, 
if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes 
or autofs is restarted on the execute machine, instead of recovering or 
failing cleanly, Condor starts to think it's working in /data/somejob, 
dropping /mnt/nfs1 entirely.  Examples from DAGMan logfiles follow:

Working directory:  
/home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
dagman.out snippet:
...
8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222  
job(s)...
8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='  
'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
macrov1triggers' '=' ' -a macroarguments' '='  
'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'  
'H1-INSPIRAL_SECOND_H1V1_F
ULL_DATA-931174952-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
+DAGParentNodeNames' '=' '"4579823e472
 
622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
inspiral_hipe_full_data.thinca2_slides_H1
V1.FULL_DATA.sub
8/7 21:38:41 From submit: Submitting job(s)
8/7 21:38:41 From submit: ERROR: No such directory:  
/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
8/7 21:38:41 failed while reading from pipe.
8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:  
/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
ll_data
8/7 21:38:41 ERROR: submit attempt failed
8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='  
'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
'50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
5496 -a macrov1triggers' '=' ' -a macroarguments' '='  
'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz' 
'H1-INSPIRAL_SECON
D_H1V1_FULL_DATA-931174952-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
+DAGParentNodeNames' '=' '"457
 
9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
inspiral_hipe_full_data.thinca2_s
lides_H1V1.FULL_DATA.sub
8/7 21:38:41 Could not change to original directory: Unable to chdir to  
S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
full_data: No such file or directory
8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
...

This continues (six tries) until the node fails.

Working directory:  
/home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
dagman.out snippet:
...
8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b  
job(s)...
8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='  
'27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
'27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
+DAGParentNodeNames' '='  
'"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
8/7 10:28:58 From submit: Submitting job(s)
8/7 10:28:58 From submit: ERROR: No such directory:  
/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
8/7 10:28:58 failed while reading from pipe.
8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:  
/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
8/7 10:28:58 ERROR: submit attempt failed
8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='  
'27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
'27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
+DAGParentNodeNames' '='  
'"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
8/7 10:28:58 Could not change to original directory: Unable to chdir to  
S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or  
directory
8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
...

Again, Condor retries six times and fails the job.

These single job failures cause the entire DAG to bail out and write a  
rescue file.  When submitted, the rescue DAG works normally provided there 
are no more NFS hiccups.

The reason I think this is an issue with the Linux NFS client is that we  
did not see this problem until the last kernel upgrade.  We're running  
Centos 5.3 (only tracking security updates) and Condor 7.2.4.  Here is  
some more detailed information:

[root@marlin ~]# condor_version
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
[root@marlin ~]# condor_dagman -v
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
[root@marlin ~]# uname -a
Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05  
EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@marlin ~]# rpm -q glibc
glibc-2.5-24  [x86_64]
glibc-2.5-34  [x86_64]
glibc-2.5-34  [i386]

Has anybody else seen this problem or something similar?  If so, what is  
the work around?  Is there more information I can provide to help track  
down this problem or somewhere else I should look?

Thanks for your assistance,
Ross Oldenburg
UWMLSC Sysadmin


_______________________________________________
Condorligo mailing list
Condorligo__AT__aei.mpg.de
http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo

----- End forwarded message -----

===========================================================================
Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
Subject: Actions

Assigned to gthain by gthain
===========================================================================
Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
Date: Tue, 18 Aug 2009 12:14:02 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor

Scott:

Have you tried the "initialdir" change that Todd suggested last week?  
Also, if we could see a submit file, that would help.  How often does 
this happen?

Thanks,

-Greg
>
> I am forwarding this to condor-admin with the subject line
> "LIGO" to officially create a LIGO support ticket.
>
> I am also asking officially that this become the highest
> priority LIGO ticket at this time. It is seriously impacting a
> high profile science analysis and users are leaving our
> cluster to compute elsewhere.
>
> I would appreciate any effort the Condor team can put into it.
> Please let us know what you need from us.
>
> Thanks,
>
> Scott
>
> ----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----
>
> Date: Thu, 13 Aug 2009 16:43:43 -0500
> From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
> To: ligosysadm__AT__gravity.psu.edu
> Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
> Subject: [CondorLIGO] mangled paths with autofs and condor
>
> Hello,
>
> We have been seeing a strange problem the past several weeks that looks  
> like an issue with the Linux NFS client, especially with respect to  
> automounted filesystems.  In summary, what happens is that if an NFS  
> server crashes or autofs hiccups or restarts, Condor mangles absolute  
> paths, dropping the mount point from the head of that path.  For example, 
> if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes 
> or autofs is restarted on the execute machine, instead of recovering or 
> failing cleanly, Condor starts to think it's working in /data/somejob, 
> dropping /mnt/nfs1 entirely.  Examples from DAGMan logfiles follow:
>
> Working directory:  
> /home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> dagman.out snippet:
> ...
> 8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222  
> job(s)...
> 8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='  
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
> GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
> acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
> macrov1triggers' '=' ' -a macroarguments' '='  
> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'  
> 'H1-INSPIRAL_SECOND_H1V1_F
> ULL_DATA-931174952-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
> +DAGParentNodeNames' '=' '"4579823e472
>  
> 622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
> inspiral_hipe_full_data.thinca2_slides_H1
> V1.FULL_DATA.sub
> 8/7 21:38:41 From submit: Submitting job(s)
> 8/7 21:38:41 From submit: ERROR: No such directory:  
> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> 8/7 21:38:41 failed while reading from pipe.
> 8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:  
> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
> ll_data
> 8/7 21:38:41 ERROR: submit attempt failed
> 8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='  
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
> 80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
> '50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
> 5496 -a macrov1triggers' '=' ' -a macroarguments' '='  
> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz' 
> 'H1-INSPIRAL_SECON
> D_H1V1_FULL_DATA-931174952-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
> +DAGParentNodeNames' '=' '"457
>  
> 9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
> inspiral_hipe_full_data.thinca2_s
> lides_H1V1.FULL_DATA.sub
> 8/7 21:38:41 Could not change to original directory: Unable to chdir to  
> S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
> full_data: No such file or directory
> 8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
> ...
>
> This continues (six tries) until the node fails.
>
> Working directory:  
> /home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> dagman.out snippet:
> ...
> 8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b  
> job(s)...
> 8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='  
> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
> +DAGParentNodeNames' '='  
> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> 8/7 10:28:58 From submit: Submitting job(s)
> 8/7 10:28:58 From submit: ERROR: No such directory:  
> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> 8/7 10:28:58 failed while reading from pipe.
> 8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:  
> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> 8/7 10:28:58 ERROR: submit attempt failed
> 8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='  
> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
> +DAGParentNodeNames' '='  
> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> 8/7 10:28:58 Could not change to original directory: Unable to chdir to  
> S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or  
> directory
> 8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
> ...
>
> Again, Condor retries six times and fails the job.
>
> These single job failures cause the entire DAG to bail out and write a  
> rescue file.  When submitted, the rescue DAG works normally provided there 
> are no more NFS hiccups.
>
> The reason I think this is an issue with the Linux NFS client is that we  
> did not see this problem until the last kernel upgrade.  We're running  
> Centos 5.3 (only tracking security updates) and Condor 7.2.4.  Here is  
> some more detailed information:
>
> [root@marlin ~]# condor_version
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
> [root@marlin ~]# condor_dagman -v
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
> [root@marlin ~]# uname -a
> Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05  
> EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> [root@marlin ~]# rpm -q glibc
> glibc-2.5-24  [x86_64]
> glibc-2.5-34  [x86_64]
> glibc-2.5-34  [i386]
>
> Has anybody else seen this problem or something similar?  If so, what is  
> the work around?  Is there more information I can provide to help track  
> down this problem or somewhere else I should look?
>
> Thanks for your assistance,
> Ross Oldenburg
> UWMLSC Sysadmin
>
>
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
>
> ----- End forwarded message -----
>
> ===========================================================================
> Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
>
>
> From RUST Tue, 18 Aug 2009 12:11:29 -0500 (CDT)
> Subject: Actions
>
> Assigned to gthain by gthain
> ===========================================================================
> Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
>
>   


===========================================================================
Date mail was appended: Tue Aug 18 12:14:08 2009 (1250615649)
Date: Tue, 18 Aug 2009 12:19:08 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
 dabrown__AT__physics.syr.edu,        rosso__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and	condor
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

> Scott:
> 
> Have you tried the "initialdir" change that Todd suggested last week?  

No. Can somebody give me a quick summary so that I can get the
info into the hands of the users?

> Also, if we could see a submit file, that would help.  

I will get one to you right after the scientists come back
from lunch. 

> How often does 
> this happen?

It does not happen with every job, or every DAG, but it is
repeatable enough that users feel they cannot make progress
with work.

Thanks,

Scott

> 
> Thanks,
> 
> -Greg
> >
> > I am forwarding this to condor-admin with the subject line
> > "LIGO" to officially create a LIGO support ticket.
> >
> > I am also asking officially that this become the highest
> > priority LIGO ticket at this time. It is seriously impacting a
> > high profile science analysis and users are leaving our
> > cluster to compute elsewhere.
> >
> > I would appreciate any effort the Condor team can put into it.
> > Please let us know what you need from us.
> >
> > Thanks,
> >
> > Scott
> >
> > ----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----
> >
> > Date: Thu, 13 Aug 2009 16:43:43 -0500
> > From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
> > To: ligosysadm__AT__gravity.psu.edu
> > Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
> > Subject: [CondorLIGO] mangled paths with autofs and condor
> >
> > Hello,
> >
> > We have been seeing a strange problem the past several weeks that looks  
> > like an issue with the Linux NFS client, especially with respect to  
> > automounted filesystems.  In summary, what happens is that if an NFS  
> > server crashes or autofs hiccups or restarts, Condor mangles absolute  
> > paths, dropping the mount point from the head of that path.  For example, 
> > if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes 
> > or autofs is restarted on the execute machine, instead of recovering or 
> > failing cleanly, Condor starts to think it's working in /data/somejob, 
> > dropping /mnt/nfs1 entirely.  Examples from DAGMan logfiles follow:
> >
> > Working directory:  
> > /home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> > dagman.out snippet:
> > ...
> > 8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222  
> > job(s)...
> > 8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='  
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
> > GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
> > acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
> > macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
> > macrov1triggers' '=' ' -a macroarguments' '='  
> > 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'  
> > 'H1-INSPIRAL_SECOND_H1V1_F
> > ULL_DATA-931174952-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
> > +DAGParentNodeNames' '=' '"4579823e472
> >  
> > 622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
> > inspiral_hipe_full_data.thinca2_slides_H1
> > V1.FULL_DATA.sub
> > 8/7 21:38:41 From submit: Submitting job(s)
> > 8/7 21:38:41 From submit: ERROR: No such directory:  
> > /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> > 8/7 21:38:41 failed while reading from pipe.
> > 8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:  
> > /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
> > ll_data
> > 8/7 21:38:41 ERROR: submit attempt failed
> > 8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='  
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
> > 80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
> > '50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
> > macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
> > 5496 -a macrov1triggers' '=' ' -a macroarguments' '='  
> > 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz' 
> > 'H1-INSPIRAL_SECON
> > D_H1V1_FULL_DATA-931174952-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
> > +DAGParentNodeNames' '=' '"457
> >  
> > 9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
> > inspiral_hipe_full_data.thinca2_s
> > lides_H1V1.FULL_DATA.sub
> > 8/7 21:38:41 Could not change to original directory: Unable to chdir to  
> > S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
> > full_data: No such file or directory
> > 8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
> > ...
> >
> > This continues (six tries) until the node fails.
> >
> > Working directory:  
> > /home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> > dagman.out snippet:
> > ...
> > 8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b  
> > job(s)...
> > 8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='  
> > '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
> > DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
> > '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
> > macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
> > macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
> > macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
> > 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
> > +DAGParentNodeNames' '='  
> > '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
> > inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> > 8/7 10:28:58 From submit: Submitting job(s)
> > 8/7 10:28:58 From submit: ERROR: No such directory:  
> > /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> > 8/7 10:28:58 failed while reading from pipe.
> > 8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:  
> > /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> > 8/7 10:28:58 ERROR: submit attempt failed
> > 8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='  
> > '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
> > DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
> > '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
> > macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
> > macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
> > macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
> > 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
> > +DAGParentNodeNames' '='  
> > '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
> > inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> > 8/7 10:28:58 Could not change to original directory: Unable to chdir to  
> > S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or  
> > directory
> > 8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
> > ...
> >
> > Again, Condor retries six times and fails the job.
> >
> > These single job failures cause the entire DAG to bail out and write a  
> > rescue file.  When submitted, the rescue DAG works normally provided there 
> > are no more NFS hiccups.
> >
> > The reason I think this is an issue with the Linux NFS client is that we  
> > did not see this problem until the last kernel upgrade.  We're running  
> > Centos 5.3 (only tracking security updates) and Condor 7.2.4.  Here is  
> > some more detailed information:
> >
> > [root@marlin ~]# condor_version
> > $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> > $CondorPlatform: X86_64-LINUX_RHEL5 $
> > [root@marlin ~]# condor_dagman -v
> > $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> > $CondorPlatform: X86_64-LINUX_RHEL5 $
> > [root@marlin ~]# uname -a
> > Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05  
> > EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> > [root@marlin ~]# rpm -q glibc
> > glibc-2.5-24  [x86_64]
> > glibc-2.5-34  [x86_64]
> > glibc-2.5-34  [i386]
> >
> > Has anybody else seen this problem or something similar?  If so, what is  
> > the work around?  Is there more information I can provide to help track  
> > down this problem or somewhere else I should look?
> >
> > Thanks for your assistance,
> > Ross Oldenburg
> > UWMLSC Sysadmin
> >
> >
> > _______________________________________________
> > Condorligo mailing list
> > Condorligo__AT__aei.mpg.de
> > http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
> >
> > ----- End forwarded message -----
> >
> > ===========================================================================
> > Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
> >
> >
> > From RUST Tue, 18 Aug 2009 12:11:29 -0500 (CDT)
> > Subject: Actions
> >
> > Assigned to gthain by gthain
> > ===========================================================================
> > Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
> >
> >   
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: skoranda__AT__gravity.phys.uwm.edu, condorligo__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,rosso__AT__gravity.phys.uwm.edu

===========================================================================
Date mail was appended: Tue Aug 18 12:19:18 2009 (1250615959)
Date: Tue, 18 Aug 2009 12:51:09 -0500
From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
 condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu, dabrown__AT__physics.syr.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and	condor
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Scott, Greg,

Scott Koranda wrote:
>> Scott:
>>
>> Have you tried the "initialdir" change that Todd suggested last week?  
>>     
>
> No. Can somebody give me a quick summary so that I can get the
> info into the hands of the users?
>   
Todd recommended explicitly specifying 
initialdir='/this/is/my/job/directory' instead of relying on the current 
working directory when generating submit files.  If autofs (or NFS in 
general) hiccups, Condor will see a NULL for the automounted part of Cwd 
and will chop off the front of the path.  The theory is that autofs 
mount points are actually symlinks to something out in kernel land.  
Condor attempts to resolve all symlinks into absolute paths.  So if 
condor_submit is attempting to resolve a path while autofs is being 
reloaded/restarted or an NFS server crashes (and hence can't be 
automounted), we end up with NULL as a mount point and get 
'/is/my/job/directory' instead of '/this/is/my/job/directory'.  I hope 
this helps clarify what we think is happening.

>   
>> Also, if we could see a submit file, that would help.  
>>     
>
> I will get one to you right after the scientists come back
> from lunch. 
>
>   
>> How often does 
>> this happen?
>>     
>
> It does not happen with every job, or every DAG, but it is
> repeatable enough that users feel they cannot make progress
> with work.
>   

It happens whenever any NFS filesystem that a job depends on gets stuck 
or momentarily disappears for some reason (a server crash, the locker 
gets screwed up, autofs restart, etc.)

If there is any more information I can provide or any suggestions for 
tests I can run, please let me know.

--Ross

> Thanks,
>
> Scott
>
>   
>> Thanks,
>>
>> -Greg
>>     
>>> I am forwarding this to condor-admin with the subject line
>>> "LIGO" to officially create a LIGO support ticket.
>>>
>>> I am also asking officially that this become the highest
>>> priority LIGO ticket at this time. It is seriously impacting a
>>> high profile science analysis and users are leaving our
>>> cluster to compute elsewhere.
>>>
>>> I would appreciate any effort the Condor team can put into it.
>>> Please let us know what you need from us.
>>>
>>> Thanks,
>>>
>>> Scott
>>>
>>> ----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----
>>>
>>> Date: Thu, 13 Aug 2009 16:43:43 -0500
>>> From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
>>> To: ligosysadm__AT__gravity.psu.edu
>>> Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
>>> Subject: [CondorLIGO] mangled paths with autofs and condor
>>>
>>> Hello,
>>>
>>> We have been seeing a strange problem the past several weeks that looks  
>>> like an issue with the Linux NFS client, especially with respect to  
>>> automounted filesystems.  In summary, what happens is that if an NFS  
>>> server crashes or autofs hiccups or restarts, Condor mangles absolute  
>>> paths, dropping the mount point from the head of that path.  For example, 
>>> if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes 
>>> or autofs is restarted on the execute machine, instead of recovering or 
>>> failing cleanly, Condor starts to think it's working in /data/somejob, 
>>> dropping /mnt/nfs1 entirely.  Examples from DAGMan logfiles follow:
>>>
>>> Working directory:  
>>> /home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
>>> dagman.out snippet:
>>> ...
>>> 8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222  
>>> job(s)...
>>> 8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='  
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
>>> GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
>>> acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
>>> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
>>> macrov1triggers' '=' ' -a macroarguments' '='  
>>> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'  
>>> 'H1-INSPIRAL_SECOND_H1V1_F
>>> ULL_DATA-931174952-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
>>> +DAGParentNodeNames' '=' '"4579823e472
>>>  
>>> 622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
>>> inspiral_hipe_full_data.thinca2_slides_H1
>>> V1.FULL_DATA.sub
>>> 8/7 21:38:41 From submit: Submitting job(s)
>>> 8/7 21:38:41 From submit: ERROR: No such directory:  
>>> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
>>> 8/7 21:38:41 failed while reading from pipe.
>>> 8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:  
>>> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
>>> ll_data
>>> 8/7 21:38:41 ERROR: submit attempt failed
>>> 8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='  
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
>>> 80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'  
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
>>> '50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a  
>>> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
>>> 5496 -a macrov1triggers' '=' ' -a macroarguments' '='  
>>> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz' 
>>> 'H1-INSPIRAL_SECON
>>> D_H1V1_FULL_DATA-931174952-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a  
>>> +DAGParentNodeNames' '=' '"457
>>>  
>>> 9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf" 
>>> inspiral_hipe_full_data.thinca2_s
>>> lides_H1V1.FULL_DATA.sub
>>> 8/7 21:38:41 Could not change to original directory: Unable to chdir to  
>>> S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
>>> full_data: No such file or directory
>>> 8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
>>> ...
>>>
>>> This continues (six tries) until the node fails.
>>>
>>> Working directory:  
>>> /home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
>>> dagman.out snippet:
>>> ...
>>> 8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b  
>>> job(s)...
>>> 8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='  
>>> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
>>> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
>>> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
>>> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
>>> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
>>> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
>>> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
>>> +DAGParentNodeNames' '='  
>>> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
>>> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
>>> 8/7 10:28:58 From submit: Submitting job(s)
>>> 8/7 10:28:58 From submit: ERROR: No such directory:  
>>> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
>>> 8/7 10:28:58 failed while reading from pipe.
>>> 8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:  
>>> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
>>> 8/7 10:28:58 ERROR: submit attempt failed
>>> 8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='  
>>> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a  
>>> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'  
>>> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a  
>>> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a  
>>> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a  
>>> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'  
>>> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'  
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a  
>>> +DAGParentNodeNames' '='  
>>> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32" 
>>> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
>>> 8/7 10:28:58 Could not change to original directory: Unable to chdir to  
>>> S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or  
>>> directory
>>> 8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
>>> ...
>>>
>>> Again, Condor retries six times and fails the job.
>>>
>>> These single job failures cause the entire DAG to bail out and write a  
>>> rescue file.  When submitted, the rescue DAG works normally provided there 
>>> are no more NFS hiccups.
>>>
>>> The reason I think this is an issue with the Linux NFS client is that we  
>>> did not see this problem until the last kernel upgrade.  We're running  
>>> Centos 5.3 (only tracking security updates) and Condor 7.2.4.  Here is  
>>> some more detailed information:
>>>
>>> [root@marlin ~]# condor_version
>>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>> [root@marlin ~]# condor_dagman -v
>>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>> [root@marlin ~]# uname -a
>>> Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05  
>>> EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>>> [root@marlin ~]# rpm -q glibc
>>> glibc-2.5-24  [x86_64]
>>> glibc-2.5-34  [x86_64]
>>> glibc-2.5-34  [i386]
>>>
>>> Has anybody else seen this problem or something similar?  If so, what is  
>>> the work around?  Is there more information I can provide to help track  
>>> down this problem or somewhere else I should look?
>>>
>>> Thanks for your assistance,
>>> Ross Oldenburg
>>> UWMLSC Sysadmin
>>>
>>>
>>> _______________________________________________
>>> Condorligo mailing list
>>> Condorligo__AT__aei.mpg.de
>>> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
>>>
>>> ----- End forwarded message -----
>>>
>>> ===========================================================================
>>> Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
>>>
>>>
>>> From RUST Tue, 18 Aug 2009 12:11:29 -0500 (CDT)
>>> Subject: Actions
>>>
>>> Assigned to gthain by gthain
>>> ===========================================================================
>>> Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
>>>
>>>   
>>>       
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Greg Thain <gthain__AT__cs.wisc.edu>
>> * Ticket Email List: skoranda__AT__gravity.phys.uwm.edu, condorligo__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,rosso__AT__gravity.phys.uwm.edu
>>     


===========================================================================
Date mail was appended: Tue Aug 18 12:51:15 2009 (1250617876)
Date: Tue, 18 Aug 2009 13:30:24 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
 dabrown__AT__physics.syr.edu,        rosso__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and	condor
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi,

> Scott:
> 
> Have you tried the "initialdir" change that Todd suggested last week?  
> Also, if we could see a submit file, that would help.  

Please download

http://www.lsc-group.phys.uwm.edu/~skoranda/LIGO-Condor-ticket-19609-01.tar.gz

In that tarball you will find 

inspiral_hipe_full_data.FULL_DATA.dag.dagman.out

That DAGman output file shows the errors, for example

8/7 21:38:41 Could not change to original directory: Unable to chdir to S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data: No such file or directory

Also in that tarball you will find all 89 submit files for
that DAG.

Please let me know if you need anything else.

Thanks,

Scott



===========================================================================
Date mail was appended: Tue Aug 18 13:30:35 2009 (1250620236)
Date: Tue, 18 Aug 2009 15:02:37 -0500
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
 condorligo__AT__aei.mpg.de, dabrown__AT__physics.syr.edu
Subject: Re: [CondorLIGO] Re: [condor-admin #19609] LIGO: mangled paths
 with autofs and condor

Scott Koranda wrote:
> Hi,
> 
>> Scott:
>>
>> Have you tried the "initialdir" change that Todd suggested last week?  

I still think initialdir suggestion will help.

But realize this is really just working around a Linux bug. Seems like 
Red Hat has seen this exact same problem and supposedly has a fix.

Take a look at
   https://bugzilla.redhat.com/show_bug.cgi?id=452122

Deja vu, eh?  :)

The last two comments are :

"This issue has been fixed in the latest autofs package 
autofs-5.0.1-0.rc2.125."

and

"RHEL 5.4 Beta has been released! There should be a fix present in the 
Beta release that addresses this particular request."

regards,
Todd


>> Also, if we could see a submit file, that would help.  
> 
> Please download
> 
> http://www.lsc-group.phys.uwm.edu/~skoranda/LIGO-Condor-ticket-19609-01.tar.gz
> 
> In that tarball you will find 
> 
> inspiral_hipe_full_data.FULL_DATA.dag.dagman.out
> 
> That DAGman output file shows the errors, for example
> 
> 8/7 21:38:41 Could not change to original directory: Unable to chdir to S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data: No such file or directory
> 
> Also in that tarball you will find all 89 submit files for
> that DAG.
> 
> Please let me know if you need anything else.
> 
> Thanks,
> 
> Scott
> 
> 
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo


-- 
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba__AT__cs.wisc.edu                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685

===========================================================================
Date mail was appended: Tue Aug 18 15:02:55 2009 (1250625776)
Date: Tue, 18 Aug 2009 15:07:16 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor

condor-admin response tracking system wrote:
> Hi,
>
>   
>> Scott:
>>
>> Have you tried the "initialdir" change that Todd suggested last week?  
>> Also, if we could see a submit file, that would help.  
>>     
>
> Please download
>
> http://www.lsc-group.phys.uwm.edu/~skoranda/LIGO-Condor-ticket-19609-01.tar.gz
>   

Thanks, Scott, this was really helpful.  When condor_submit runs, if 
initidir isn't set in the submit file, it calls
getcwd to populate the IWD.  This getcwd seems to be failing because of 
automount issues.  See https://bugzilla.redhat.com/show_bug.cgi?id=452122

if the InitialDir option is set in the condor-submit file, then condor 
doesn't run the getcwd.  This would be the best way to work around the 
problem.

-Greg


===========================================================================
Date mail was appended: Tue Aug 18 15:07:21 2009 (1250626041)
Date: Tue, 18 Aug 2009 15:22:48 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
 dabrown__AT__physics.syr.edu,        rosso__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and	condor
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

> Scott Koranda wrote:
> > Hi,
> > 
> >> Scott:
> >>
> >> Have you tried the "initialdir" change that Todd suggested last week?  
> 
> I still think initialdir suggestion will help.

Thanks. Unfortunately it is not simple to change the scripts
that generate the DAG/submit files to add an appropriate value
for 'initialdir'.

> 
> But realize this is really just working around a Linux bug. Seems like 
> Red Hat has seen this exact same problem and supposedly has a fix.
> 
> Take a look at
>    https://bugzilla.redhat.com/show_bug.cgi?id=452122
> 
> Deja vu, eh?  :)
> 
> The last two comments are :
> 
> "This issue has been fixed in the latest autofs package 
> autofs-5.0.1-0.rc2.125."
> 
> and
> 
> "RHEL 5.4 Beta has been released! There should be a fix present in the 
> Beta release that addresses this particular request."
> 
> regards,
> Todd
> 

Looking at the latest CentOS SRPM at

http://mirror.anl.gov/pub/centos/5.3/updates/SRPMS/autofs-5.0.1-0.rc2.102.el5_3.1.src.rpm

I don't see any comment about that bug. So I don't think we
have access to the supposed fix yet.

I am probably going to suggest that we stop using automount
and just mount the NFS partitions directly.

Thanks,

Scott

===========================================================================
Date mail was appended: Tue Aug 18 15:22:56 2009 (1250626977)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Tue Aug 18 15:28:55 2009 (1250627335)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Tue Aug 18 16:12:07 2009 (1250629927)
Date: Tue, 18 Aug 2009 16:10:43 -0500
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
 condorligo__AT__aei.mpg.de, dabrown__AT__physics.syr.edu
Subject: Re: [CondorLIGO] Re: [condor-admin #19609] LIGO: mangled paths
 with autofs and condor

Scott Koranda wrote:
> Looking at the latest CentOS SRPM at
> 
> http://mirror.anl.gov/pub/centos/5.3/updates/SRPMS/autofs-5.0.1-0.rc2.102.el5_3.1.src.rpm
> 
> I don't see any comment about that bug. So I don't think we
> have access to the supposed fix yet.
> 

The above is rev 102 of autofs.  You need rev 125 or above, or you could 
use something newer like autofs 5.0.4.

You could always build autofs yourself right from the source of the 
bits, instead of waiting around for Centos:
   http://ftp.kernel.org/pub/linux/daemons/autofs/v5/

> I am probably going to suggest that we stop using automount
> and just mount the NFS partitions directly.
>

Yeah, I guess that would work as well.  :)

regards,
Todd

-- 
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba__AT__cs.wisc.edu                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685


===========================================================================
Date mail was appended: Tue Aug 18 16:12:07 2009 (1250629927)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condorligo__AT__aei.mpg.de
Subject: Re: [CondorLIGO] Re: [condor-admin #19609] LIGO: mangled paths
 with autofs and condor
Date: Wed, 19 Aug 2009 09:14:09 +0200
CC: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>,        Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>,
 "condor-admin response tracking system" <condor-admin__AT__cs.wisc.edu>
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Todd,

On Tuesday 18 August 2009 23:10:43 Todd Tannenbaum wrote:
>
> The above is rev 102 of autofs.  You need rev 125 or above, or you could
> use something newer like autofs 5.0.4.

rev \d+ looks like subversion revisions to me, although autofs5 is hosted with 
git AFAIK.

Can you point me please to the patch (or part of the patch) which fixes the 
observed problem? I would like to see if the Debian package of autofs5 5.0.3 
contains this patch or not.

Cheers

Carsten

===========================================================================
Date mail was appended: Wed Aug 19  2:14:22 2009 (1250666063)
Date: Wed, 19 Aug 2009 11:58:41 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor

condor-admin response tracking system wrote:
> Hi Todd,
>
> On Tuesday 18 August 2009 23:10:43 Todd Tannenbaum wrote:
>   
>> The above is rev 102 of autofs.  You need rev 125 or above, or you could
>> use something newer like autofs 5.0.4.
>>     
>
> rev \d+ looks like subversion revisions to me, although autofs5 is hosted with 
> git AFAIK.
>   

I'm not sure how rpms map into debian releases, but the fix is in 
autofs-5.0.1-0.rc2.125.  I assume you are running Debian stable, but I 
can't figure out if that patch is in stable.

-Greg



===========================================================================
Date mail was appended: Wed Aug 19 11:58:46 2009 (1250701126)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Thu Aug 20 13:45:16 2009 (1250793916)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Mon Aug 24 10:11:46 2009 (1251126706)
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
 Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>,        Stuart Anderson
 <anderson__AT__ligo.caltech.edu>,        Ross Oldenburg
 <rosso__AT__gravity.phys.uwm.edu>,        Carsten Aulbert
 <carsten.aulbert__AT__aei.mpg.de>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
Date: Mon, 24 Aug 2009 04:28:21 -0700
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-08-24_06:2009-08-11,2009-08-24,2009-08-24 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
 reason=mlx engine=5.0.0-0907200000 definitions=main-0908240082
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Scott,

On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
> I am probably going to suggest that we stop using automount
> and just mount the NFS partitions directly.

I think this is a good idea. For user home directories, automount  
seems to cause more problems than it solves.

Cheers,
Duncan.

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Mon Aug 24 10:11:46 2009 (1251126706)
Date: Mon, 24 Aug 2009 10:20:24 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor

condor-admin response tracking system wrote:
> Hi Scott,
>
> On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
>   
>> I am probably going to suggest that we stop using automount
>> and just mount the NFS partitions directly.
LIGO-ers:

Do y'all need any more help from us, or can I close this ticket?  As you 
know, we're very keen on seeing you succeed, so even if it isn't a 
specific Condor problem, we're willing to help out as much as we can.

-Greg

===========================================================================
Date mail was appended: Mon Aug 24 10:20:30 2009 (1251127231)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Mon Aug 24 10:22:01 2009 (1251127321)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Mon Aug 24 11:05:39 2009 (1251129940)
Date: Mon, 24 Aug 2009 11:05:30 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
 dabrown__AT__physics.syr.edu,        rosso__AT__gravity.phys.uwm.edu,
 carsten.aulbert__AT__aei.mpg.de
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and	condor
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi,

> condor-admin response tracking system wrote:
> > Hi Scott,
> >
> > On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
> >   
> >> I am probably going to suggest that we stop using automount
> >> and just mount the NFS partitions directly.
> LIGO-ers:
> 
> Do y'all need any more help from us, or can I close this ticket?  

I think you should close the ticket, but I defer to Stuart...

> As you 
> know, we're very keen on seeing you succeed, so even if it isn't a 
> specific Condor problem, we're willing to help out as much as we can.

Thanks much. I appreciate your time.

Cheers,

Scott


===========================================================================
Date mail was appended: Mon Aug 24 11:05:39 2009 (1251129940)
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
 condorligo__AT__aei.mpg.de, dabrown__AT__physics.syr.edu, rosso__AT__gravity.phys.uwm.edu,
 carsten.aulbert__AT__aei.mpg.de
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
Date: Mon, 24 Aug 2009 09:23:48 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu


On Aug 24, 2009, at 9:05 AM, Scott Koranda wrote:

> Hi,
>
>> condor-admin response tracking system wrote:
>>> Hi Scott,
>>>
>>> On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
>>>
>>>> I am probably going to suggest that we stop using automount
>>>> and just mount the NFS partitions directly.
>> LIGO-ers:
>>
>> Do y'all need any more help from us, or can I close this ticket?
>
> I think you should close the ticket, but I defer to Stuart...

 From what I understand this ticket should be closed. The Condor team  
support has been very helpful, but now that the problem is isolated to  
Linux automount issues the ball is strictly in LIGO's court.

Thanks.


--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Mon Aug 24 11:24:03 2009 (1251131044)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Mon Aug 24 17:15:22 2009 (1251152122)