LIGO Support Ticket 19609
Ticket Information
Number: admin 19609
User: skoranda@gravity.phys.uwm.edu
Email: condorligo__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,rosso__AT__gravity.phys.uwm.edu,carsten.aulbert__AT__aei.mpg.de
Status: resolved
Assigned To: gthain
Date: Tue, 18 Aug 2009 10:24:58 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: condorligo__AT__aei.mpg.de, Stuart Anderson <anderson__AT__ligo.caltech.edu>,
Duncan Brown <dabrown__AT__physics.syr.edu>, Ross Oldenburg
<rosso__AT__gravity.phys.uwm.edu>
Subject: LIGO: mangled paths with autofs and condor
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi,
I am forwarding this to condor-admin with the subject line
"LIGO" to officially create a LIGO support ticket.
I am also asking officially that this become the highest
priority LIGO ticket at this time. It is seriously impacting a
high profile science analysis and users are leaving our
cluster to compute elsewhere.
I would appreciate any effort the Condor team can put into it.
Please let us know what you need from us.
Thanks,
Scott
----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----
Date: Thu, 13 Aug 2009 16:43:43 -0500
From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
To: ligosysadm__AT__gravity.psu.edu
Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
Subject: [CondorLIGO] mangled paths with autofs and condor
Hello,
We have been seeing a strange problem the past several weeks that looks
like an issue with the Linux NFS client, especially with respect to
automounted filesystems. In summary, what happens is that if an NFS
server crashes or autofs hiccups or restarts, Condor mangles absolute
paths, dropping the mount point from the head of that path. For example,
if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes
or autofs is restarted on the execute machine, instead of recovering or
failing cleanly, Condor starts to think it's working in /data/somejob,
dropping /mnt/nfs1 entirely. Examples from DAGMan logfiles follow:
Working directory:
/home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
dagman.out snippet:
...
8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222
job(s)...
8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='
'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
macrov1triggers' '=' ' -a macroarguments' '='
'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
'H1-INSPIRAL_SECOND_H1V1_F
ULL_DATA-931174952-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
+DAGParentNodeNames' '=' '"4579823e472
622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
inspiral_hipe_full_data.thinca2_slides_H1
V1.FULL_DATA.sub
8/7 21:38:41 From submit: Submitting job(s)
8/7 21:38:41 From submit: ERROR: No such directory:
/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
8/7 21:38:41 failed while reading from pipe.
8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:
/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
ll_data
8/7 21:38:41 ERROR: submit attempt failed
8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='
'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
'50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
5496 -a macrov1triggers' '=' ' -a macroarguments' '='
'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
'H1-INSPIRAL_SECON
D_H1V1_FULL_DATA-931174952-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
+DAGParentNodeNames' '=' '"457
9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
inspiral_hipe_full_data.thinca2_s
lides_H1V1.FULL_DATA.sub
8/7 21:38:41 Could not change to original directory: Unable to chdir to
S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
full_data: No such file or directory
8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
...
This continues (six tries) until the node fails.
Working directory:
/home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
dagman.out snippet:
...
8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b
job(s)...
8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='
'27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
'27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
+DAGParentNodeNames' '='
'"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
8/7 10:28:58 From submit: Submitting job(s)
8/7 10:28:58 From submit: ERROR: No such directory:
/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
8/7 10:28:58 failed while reading from pipe.
8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:
/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
8/7 10:28:58 ERROR: submit attempt failed
8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='
'27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
'27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
+DAGParentNodeNames' '='
'"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
8/7 10:28:58 Could not change to original directory: Unable to chdir to
S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or
directory
8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
...
Again, Condor retries six times and fails the job.
These single job failures cause the entire DAG to bail out and write a
rescue file. When submitted, the rescue DAG works normally provided there
are no more NFS hiccups.
The reason I think this is an issue with the Linux NFS client is that we
did not see this problem until the last kernel upgrade. We're running
Centos 5.3 (only tracking security updates) and Condor 7.2.4. Here is
some more detailed information:
[root@marlin ~]# condor_version
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
[root@marlin ~]# condor_dagman -v
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
[root@marlin ~]# uname -a
Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05
EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@marlin ~]# rpm -q glibc
glibc-2.5-24 [x86_64]
glibc-2.5-34 [x86_64]
glibc-2.5-34 [i386]
Has anybody else seen this problem or something similar? If so, what is
the work around? Is there more information I can provide to help track
down this problem or somewhere else I should look?
Thanks for your assistance,
Ross Oldenburg
UWMLSC Sysadmin
_______________________________________________
Condorligo mailing list
Condorligo__AT__aei.mpg.de
http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
----- End forwarded message -----
===========================================================================
Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
Subject: Actions
Assigned to gthain by gthain
===========================================================================
Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
Date: Tue, 18 Aug 2009 12:14:02 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
Scott:
Have you tried the "initialdir" change that Todd suggested last week?
Also, if we could see a submit file, that would help. How often does
this happen?
Thanks,
-Greg
>
> I am forwarding this to condor-admin with the subject line
> "LIGO" to officially create a LIGO support ticket.
>
> I am also asking officially that this become the highest
> priority LIGO ticket at this time. It is seriously impacting a
> high profile science analysis and users are leaving our
> cluster to compute elsewhere.
>
> I would appreciate any effort the Condor team can put into it.
> Please let us know what you need from us.
>
> Thanks,
>
> Scott
>
> ----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----
>
> Date: Thu, 13 Aug 2009 16:43:43 -0500
> From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
> To: ligosysadm__AT__gravity.psu.edu
> Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
> Subject: [CondorLIGO] mangled paths with autofs and condor
>
> Hello,
>
> We have been seeing a strange problem the past several weeks that looks
> like an issue with the Linux NFS client, especially with respect to
> automounted filesystems. In summary, what happens is that if an NFS
> server crashes or autofs hiccups or restarts, Condor mangles absolute
> paths, dropping the mount point from the head of that path. For example,
> if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes
> or autofs is restarted on the execute machine, instead of recovering or
> failing cleanly, Condor starts to think it's working in /data/somejob,
> dropping /mnt/nfs1 entirely. Examples from DAGMan logfiles follow:
>
> Working directory:
> /home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> dagman.out snippet:
> ...
> 8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222
> job(s)...
> 8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
> GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
> acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
> macrov1triggers' '=' ' -a macroarguments' '='
> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
> 'H1-INSPIRAL_SECOND_H1V1_F
> ULL_DATA-931174952-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
> +DAGParentNodeNames' '=' '"4579823e472
>
> 622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
> inspiral_hipe_full_data.thinca2_slides_H1
> V1.FULL_DATA.sub
> 8/7 21:38:41 From submit: Submitting job(s)
> 8/7 21:38:41 From submit: ERROR: No such directory:
> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> 8/7 21:38:41 failed while reading from pipe.
> 8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:
> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
> ll_data
> 8/7 21:38:41 ERROR: submit attempt failed
> 8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
> 80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
> '50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
> 5496 -a macrov1triggers' '=' ' -a macroarguments' '='
> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
> 'H1-INSPIRAL_SECON
> D_H1V1_FULL_DATA-931174952-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
> +DAGParentNodeNames' '=' '"457
>
> 9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
> inspiral_hipe_full_data.thinca2_s
> lides_H1V1.FULL_DATA.sub
> 8/7 21:38:41 Could not change to original directory: Unable to chdir to
> S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
> full_data: No such file or directory
> 8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
> ...
>
> This continues (six tries) until the node fails.
>
> Working directory:
> /home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> dagman.out snippet:
> ...
> 8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b
> job(s)...
> 8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='
> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
> +DAGParentNodeNames' '='
> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> 8/7 10:28:58 From submit: Submitting job(s)
> 8/7 10:28:58 From submit: ERROR: No such directory:
> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> 8/7 10:28:58 failed while reading from pipe.
> 8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:
> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> 8/7 10:28:58 ERROR: submit attempt failed
> 8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='
> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
> +DAGParentNodeNames' '='
> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> 8/7 10:28:58 Could not change to original directory: Unable to chdir to
> S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or
> directory
> 8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
> ...
>
> Again, Condor retries six times and fails the job.
>
> These single job failures cause the entire DAG to bail out and write a
> rescue file. When submitted, the rescue DAG works normally provided there
> are no more NFS hiccups.
>
> The reason I think this is an issue with the Linux NFS client is that we
> did not see this problem until the last kernel upgrade. We're running
> Centos 5.3 (only tracking security updates) and Condor 7.2.4. Here is
> some more detailed information:
>
> [root@marlin ~]# condor_version
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
> [root@marlin ~]# condor_dagman -v
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
> [root@marlin ~]# uname -a
> Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05
> EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> [root@marlin ~]# rpm -q glibc
> glibc-2.5-24 [x86_64]
> glibc-2.5-34 [x86_64]
> glibc-2.5-34 [i386]
>
> Has anybody else seen this problem or something similar? If so, what is
> the work around? Is there more information I can provide to help track
> down this problem or somewhere else I should look?
>
> Thanks for your assistance,
> Ross Oldenburg
> UWMLSC Sysadmin
>
>
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
>
> ----- End forwarded message -----
>
> ===========================================================================
> Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
>
>
> From RUST Tue, 18 Aug 2009 12:11:29 -0500 (CDT)
> Subject: Actions
>
> Assigned to gthain by gthain
> ===========================================================================
> Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
>
>
===========================================================================
Date mail was appended: Tue Aug 18 12:14:08 2009 (1250615649)
Date: Tue, 18 Aug 2009 12:19:08 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
dabrown__AT__physics.syr.edu, rosso__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
> Scott:
>
> Have you tried the "initialdir" change that Todd suggested last week?
No. Can somebody give me a quick summary so that I can get the
info into the hands of the users?
> Also, if we could see a submit file, that would help.
I will get one to you right after the scientists come back
from lunch.
> How often does
> this happen?
It does not happen with every job, or every DAG, but it is
repeatable enough that users feel they cannot make progress
with work.
Thanks,
Scott
>
> Thanks,
>
> -Greg
> >
> > I am forwarding this to condor-admin with the subject line
> > "LIGO" to officially create a LIGO support ticket.
> >
> > I am also asking officially that this become the highest
> > priority LIGO ticket at this time. It is seriously impacting a
> > high profile science analysis and users are leaving our
> > cluster to compute elsewhere.
> >
> > I would appreciate any effort the Condor team can put into it.
> > Please let us know what you need from us.
> >
> > Thanks,
> >
> > Scott
> >
> > ----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----
> >
> > Date: Thu, 13 Aug 2009 16:43:43 -0500
> > From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
> > To: ligosysadm__AT__gravity.psu.edu
> > Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
> > Subject: [CondorLIGO] mangled paths with autofs and condor
> >
> > Hello,
> >
> > We have been seeing a strange problem the past several weeks that looks
> > like an issue with the Linux NFS client, especially with respect to
> > automounted filesystems. In summary, what happens is that if an NFS
> > server crashes or autofs hiccups or restarts, Condor mangles absolute
> > paths, dropping the mount point from the head of that path. For example,
> > if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes
> > or autofs is restarted on the execute machine, instead of recovering or
> > failing cleanly, Condor starts to think it's working in /data/somejob,
> > dropping /mnt/nfs1 entirely. Examples from DAGMan logfiles follow:
> >
> > Working directory:
> > /home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> > dagman.out snippet:
> > ...
> > 8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222
> > job(s)...
> > 8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
> > GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
> > acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
> > macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
> > macrov1triggers' '=' ' -a macroarguments' '='
> > 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
> > 'H1-INSPIRAL_SECOND_H1V1_F
> > ULL_DATA-931174952-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
> > +DAGParentNodeNames' '=' '"4579823e472
> >
> > 622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
> > inspiral_hipe_full_data.thinca2_slides_H1
> > V1.FULL_DATA.sub
> > 8/7 21:38:41 From submit: Submitting job(s)
> > 8/7 21:38:41 From submit: ERROR: No such directory:
> > /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
> > 8/7 21:38:41 failed while reading from pipe.
> > 8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:
> > /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
> > ll_data
> > 8/7 21:38:41 ERROR: submit attempt failed
> > 8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
> > 80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
> > 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
> > '50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
> > macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
> > 5496 -a macrov1triggers' '=' ' -a macroarguments' '='
> > 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
> > 'H1-INSPIRAL_SECON
> > D_H1V1_FULL_DATA-931174952-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
> > +DAGParentNodeNames' '=' '"457
> >
> > 9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
> > inspiral_hipe_full_data.thinca2_s
> > lides_H1V1.FULL_DATA.sub
> > 8/7 21:38:41 Could not change to original directory: Unable to chdir to
> > S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
> > full_data: No such file or directory
> > 8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
> > ...
> >
> > This continues (six tries) until the node fails.
> >
> > Working directory:
> > /home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> > dagman.out snippet:
> > ...
> > 8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b
> > job(s)...
> > 8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='
> > '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
> > DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
> > '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
> > macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
> > macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
> > macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
> > 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
> > +DAGParentNodeNames' '='
> > '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
> > inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> > 8/7 10:28:58 From submit: Submitting job(s)
> > 8/7 10:28:58 From submit: ERROR: No such directory:
> > /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> > 8/7 10:28:58 failed while reading from pipe.
> > 8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:
> > /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
> > 8/7 10:28:58 ERROR: submit attempt failed
> > 8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='
> > '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
> > DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
> > '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
> > macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
> > macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
> > macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
> > 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
> > 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
> > +DAGParentNodeNames' '='
> > '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
> > inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
> > 8/7 10:28:58 Could not change to original directory: Unable to chdir to
> > S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or
> > directory
> > 8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
> > ...
> >
> > Again, Condor retries six times and fails the job.
> >
> > These single job failures cause the entire DAG to bail out and write a
> > rescue file. When submitted, the rescue DAG works normally provided there
> > are no more NFS hiccups.
> >
> > The reason I think this is an issue with the Linux NFS client is that we
> > did not see this problem until the last kernel upgrade. We're running
> > Centos 5.3 (only tracking security updates) and Condor 7.2.4. Here is
> > some more detailed information:
> >
> > [root@marlin ~]# condor_version
> > $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> > $CondorPlatform: X86_64-LINUX_RHEL5 $
> > [root@marlin ~]# condor_dagman -v
> > $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> > $CondorPlatform: X86_64-LINUX_RHEL5 $
> > [root@marlin ~]# uname -a
> > Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05
> > EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> > [root@marlin ~]# rpm -q glibc
> > glibc-2.5-24 [x86_64]
> > glibc-2.5-34 [x86_64]
> > glibc-2.5-34 [i386]
> >
> > Has anybody else seen this problem or something similar? If so, what is
> > the work around? Is there more information I can provide to help track
> > down this problem or somewhere else I should look?
> >
> > Thanks for your assistance,
> > Ross Oldenburg
> > UWMLSC Sysadmin
> >
> >
> > _______________________________________________
> > Condorligo mailing list
> > Condorligo__AT__aei.mpg.de
> > http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
> >
> > ----- End forwarded message -----
> >
> > ===========================================================================
> > Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
> >
> >
> > From RUST Tue, 18 Aug 2009 12:11:29 -0500 (CDT)
> > Subject: Actions
> >
> > Assigned to gthain by gthain
> > ===========================================================================
> > Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
> >
> >
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: skoranda__AT__gravity.phys.uwm.edu, condorligo__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,rosso__AT__gravity.phys.uwm.edu
===========================================================================
Date mail was appended: Tue Aug 18 12:19:18 2009 (1250615959)
Date: Tue, 18 Aug 2009 12:51:09 -0500
From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu, dabrown__AT__physics.syr.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi Scott, Greg,
Scott Koranda wrote:
>> Scott:
>>
>> Have you tried the "initialdir" change that Todd suggested last week?
>>
>
> No. Can somebody give me a quick summary so that I can get the
> info into the hands of the users?
>
Todd recommended explicitly specifying
initialdir='/this/is/my/job/directory' instead of relying on the current
working directory when generating submit files. If autofs (or NFS in
general) hiccups, Condor will see a NULL for the automounted part of Cwd
and will chop off the front of the path. The theory is that autofs
mount points are actually symlinks to something out in kernel land.
Condor attempts to resolve all symlinks into absolute paths. So if
condor_submit is attempting to resolve a path while autofs is being
reloaded/restarted or an NFS server crashes (and hence can't be
automounted), we end up with NULL as a mount point and get
'/is/my/job/directory' instead of '/this/is/my/job/directory'. I hope
this helps clarify what we think is happening.
>
>> Also, if we could see a submit file, that would help.
>>
>
> I will get one to you right after the scientists come back
> from lunch.
>
>
>> How often does
>> this happen?
>>
>
> It does not happen with every job, or every DAG, but it is
> repeatable enough that users feel they cannot make progress
> with work.
>
It happens whenever any NFS filesystem that a job depends on gets stuck
or momentarily disappears for some reason (a server crash, the locker
gets screwed up, autofs restart, etc.)
If there is any more information I can provide or any suggestions for
tests I can run, please let me know.
--Ross
> Thanks,
>
> Scott
>
>
>> Thanks,
>>
>> -Greg
>>
>>> I am forwarding this to condor-admin with the subject line
>>> "LIGO" to officially create a LIGO support ticket.
>>>
>>> I am also asking officially that this become the highest
>>> priority LIGO ticket at this time. It is seriously impacting a
>>> high profile science analysis and users are leaving our
>>> cluster to compute elsewhere.
>>>
>>> I would appreciate any effort the Condor team can put into it.
>>> Please let us know what you need from us.
>>>
>>> Thanks,
>>>
>>> Scott
>>>
>>> ----- Forwarded message from Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu> -----
>>>
>>> Date: Thu, 13 Aug 2009 16:43:43 -0500
>>> From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
>>> To: ligosysadm__AT__gravity.psu.edu
>>> Cc: condor-users-request__AT__cs.wisc.edu, condorligo__AT__aei.mpg.de
>>> Subject: [CondorLIGO] mangled paths with autofs and condor
>>>
>>> Hello,
>>>
>>> We have been seeing a strange problem the past several weeks that looks
>>> like an issue with the Linux NFS client, especially with respect to
>>> automounted filesystems. In summary, what happens is that if an NFS
>>> server crashes or autofs hiccups or restarts, Condor mangles absolute
>>> paths, dropping the mount point from the head of that path. For example,
>>> if my Condor job is working out of /mnt/nfs1/data/somejob and nfs1 crashes
>>> or autofs is restarted on the execute machine, instead of recovering or
>>> failing cleanly, Condor starts to think it's working in /data/somejob,
>>> dropping /mnt/nfs1 entirely. Examples from DAGMan logfiles follow:
>>>
>>> Working directory:
>>> /home/jclayton/S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
>>> dagman.out snippet:
>>> ...
>>> 8/7 21:38:41 Submitting Condor Node b12e16d65e95e43d0b5e8a63b79fb222
>>> job(s)...
>>> 8/7 21:38:41 submitting: condor_submit -a dag_node_name' '='
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '2748980 -a DA
>>> GManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '=' '50 -a m
>>> acroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
>>> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '931175496 -a
>>> macrov1triggers' '=' ' -a macroarguments' '='
>>> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
>>> 'H1-INSPIRAL_SECOND_H1V1_F
>>> ULL_DATA-931174952-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
>>> +DAGParentNodeNames' '=' '"4579823e472
>>>
>>> 622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
>>> inspiral_hipe_full_data.thinca2_slides_H1
>>> V1.FULL_DATA.sub
>>> 8/7 21:38:41 From submit: Submitting job(s)
>>> 8/7 21:38:41 From submit: ERROR: No such directory:
>>> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data
>>> 8/7 21:38:41 failed while reading from pipe.
>>> 8/7 21:38:41 Read so far: Submitting job(s)ERROR: No such directory:
>>> /S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/fu
>>> ll_data
>>> 8/7 21:38:41 ERROR: submit attempt failed
>>> 8/7 21:38:41 submit command was: condor_submit -a dag_node_name' '='
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a +DAGManJobId' '=' '27489
>>> 80 -a DAGManJobId' '=' '2748980 -a submit_event_notes' '=' 'DAG' 'Node:'
>>> 'b12e16d65e95e43d0b5e8a63b79fb222 -a macronumslides' '='
>>> '50 -a macroh1triggers' '=' ' -a macroifotag' '=' 'SECOND_H1V1 -a
>>> macrogpsstarttime' '=' '931174528 -a macrogpsendtime' '=' '93117
>>> 5496 -a macrov1triggers' '=' ' -a macroarguments' '='
>>> 'H1-INSPIRAL_SECOND_H1V1_FULL_DATA-931174464-2048.xml.gz'
>>> 'H1-INSPIRAL_SECON
>>> D_H1V1_FULL_DATA-931174952-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_FULL_DATA-931173977-2048.xml.gz -a
>>> +DAGParentNodeNames' '=' '"457
>>>
>>> 9823e472622649f77bc6d00289d48,34623f843b05aa5b464047e38216a7e4,0f8342bf051b45f35ac57dd586d4cabf"
>>> inspiral_hipe_full_data.thinca2_s
>>> lides_H1V1.FULL_DATA.sub
>>> 8/7 21:38:41 Could not change to original directory: Unable to chdir to
>>> S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/
>>> full_data: No such file or directory
>>> 8/7 21:38:41 Job submit try 1/6 failed, will try again in >= 1 second.
>>> ...
>>>
>>> This continues (six tries) until the node fails.
>>>
>>> Working directory:
>>> /home/larry/S6_weekly/lowmass/week3/932169543-932774487/bbhinj
>>> dagman.out snippet:
>>> ...
>>> 8/7 10:28:58 Submitting Condor Node 27d612a286265b135f502bb68ae6628b
>>> job(s)...
>>> 8/7 10:28:58 submitting: condor_submit -a dag_node_name' '='
>>> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
>>> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
>>> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
>>> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
>>> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
>>> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
>>> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
>>> +DAGParentNodeNames' '='
>>> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
>>> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
>>> 8/7 10:28:58 From submit: Submitting job(s)
>>> 8/7 10:28:58 From submit: ERROR: No such directory:
>>> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
>>> 8/7 10:28:58 failed while reading from pipe.
>>> 8/7 10:28:58 Read so far: Submitting job(s)ERROR: No such directory:
>>> /S6_weekly/lowmass/week3/932169543-932774487/bbhinj
>>> 8/7 10:28:58 ERROR: submit attempt failed
>>> 8/7 10:28:58 submit command was: condor_submit -a dag_node_name' '='
>>> '27d612a286265b135f502bb68ae6628b -a +DAGManJobId' '=' '6419231 -a
>>> DAGManJobId' '=' '6419231 -a submit_event_notes' '=' 'DAG' 'Node:'
>>> '27d612a286265b135f502bb68ae6628b -a macrov1triggers' '=' ' -a
>>> macrogpsendtime' '=' '932670135 -a macroh1triggers' '=' ' -a
>>> macrogpsstarttime' '=' '932666971 -a macroifotag' '=' 'SECOND_H1V1 -a
>>> macroarguments' '=' 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932666907-2048.xml.gz'
>>> 'H1-INSPIRAL_SECOND_H1V1_BBHINJ-932668827-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932665260-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932667180-2048.xml.gz'
>>> 'V1-INSPIRAL_SECOND_H1V1_BBHINJ-932669100-2048.xml.gz -a
>>> +DAGParentNodeNames' '='
>>> '"bbbce8b5146116c109b02fc767948090,5f02925d7627146ec0d429a96ec52a92,e973da9d3acda11972dfb0c81980e1e9,5ee70c2feb48e2083aa387302f7e8357,f568cd86fe7e641bd640abdb41649e32"
>>> inspiral_hipe_bbhinj.thinca2_H1V1.BBHINJ.sub
>>> 8/7 10:28:58 Could not change to original directory: Unable to chdir to
>>> S6_weekly/lowmass/week3/932169543-932774487/bbhinj: No such file or
>>> directory
>>> 8/7 10:28:58 Job submit try 1/6 failed, will try again in >= 1 second.
>>> ...
>>>
>>> Again, Condor retries six times and fails the job.
>>>
>>> These single job failures cause the entire DAG to bail out and write a
>>> rescue file. When submitted, the rescue DAG works normally provided there
>>> are no more NFS hiccups.
>>>
>>> The reason I think this is an issue with the Linux NFS client is that we
>>> did not see this problem until the last kernel upgrade. We're running
>>> Centos 5.3 (only tracking security updates) and Condor 7.2.4. Here is
>>> some more detailed information:
>>>
>>> [root@marlin ~]# condor_version
>>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>> [root@marlin ~]# condor_dagman -v
>>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>> [root@marlin ~]# uname -a
>>> Linux marlin.phys.uwm.edu 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05
>>> EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>>> [root@marlin ~]# rpm -q glibc
>>> glibc-2.5-24 [x86_64]
>>> glibc-2.5-34 [x86_64]
>>> glibc-2.5-34 [i386]
>>>
>>> Has anybody else seen this problem or something similar? If so, what is
>>> the work around? Is there more information I can provide to help track
>>> down this problem or somewhere else I should look?
>>>
>>> Thanks for your assistance,
>>> Ross Oldenburg
>>> UWMLSC Sysadmin
>>>
>>>
>>> _______________________________________________
>>> Condorligo mailing list
>>> Condorligo__AT__aei.mpg.de
>>> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
>>>
>>> ----- End forwarded message -----
>>>
>>> ===========================================================================
>>> Date of creation: Tue Aug 18 10:25:08 2009 (1250609110)
>>>
>>>
>>> From RUST Tue, 18 Aug 2009 12:11:29 -0500 (CDT)
>>> Subject: Actions
>>>
>>> Assigned to gthain by gthain
>>> ===========================================================================
>>> Date of actions: Tue Aug 18 12:11:29 2009 (1250615489)
>>>
>>>
>>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Greg Thain <gthain__AT__cs.wisc.edu>
>> * Ticket Email List: skoranda__AT__gravity.phys.uwm.edu, condorligo__AT__aei.mpg.de,anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,rosso__AT__gravity.phys.uwm.edu
>>
===========================================================================
Date mail was appended: Tue Aug 18 12:51:15 2009 (1250617876)
Date: Tue, 18 Aug 2009 13:30:24 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
dabrown__AT__physics.syr.edu, rosso__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi,
> Scott:
>
> Have you tried the "initialdir" change that Todd suggested last week?
> Also, if we could see a submit file, that would help.
Please download
http://www.lsc-group.phys.uwm.edu/~skoranda/LIGO-Condor-ticket-19609-01.tar.gz
In that tarball you will find
inspiral_hipe_full_data.FULL_DATA.dag.dagman.out
That DAGman output file shows the errors, for example
8/7 21:38:41 Could not change to original directory: Unable to chdir to S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data: No such file or directory
Also in that tarball you will find all 89 submit files for
that DAG.
Please let me know if you need anything else.
Thanks,
Scott
===========================================================================
Date mail was appended: Tue Aug 18 13:30:35 2009 (1250620236)
Date: Tue, 18 Aug 2009 15:02:37 -0500
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
condorligo__AT__aei.mpg.de, dabrown__AT__physics.syr.edu
Subject: Re: [CondorLIGO] Re: [condor-admin #19609] LIGO: mangled paths
with autofs and condor
Scott Koranda wrote:
> Hi,
>
>> Scott:
>>
>> Have you tried the "initialdir" change that Todd suggested last week?
I still think initialdir suggestion will help.
But realize this is really just working around a Linux bug. Seems like
Red Hat has seen this exact same problem and supposedly has a fix.
Take a look at
https://bugzilla.redhat.com/show_bug.cgi?id=452122
Deja vu, eh? :)
The last two comments are :
"This issue has been fixed in the latest autofs package
autofs-5.0.1-0.rc2.125."
and
"RHEL 5.4 Beta has been released! There should be a fix present in the
Beta release that addresses this particular request."
regards,
Todd
>> Also, if we could see a submit file, that would help.
>
> Please download
>
> http://www.lsc-group.phys.uwm.edu/~skoranda/LIGO-Condor-ticket-19609-01.tar.gz
>
> In that tarball you will find
>
> inspiral_hipe_full_data.FULL_DATA.dag.dagman.out
>
> That DAGman output file shows the errors, for example
>
> 8/7 21:38:41 Could not change to original directory: Unable to chdir to S6/weeklyruns/lowmass/highthreshweek1/931035296-931564887/full_data: No such file or directory
>
> Also in that tarball you will find all 89 submit files for
> that DAG.
>
> Please let me know if you need anything else.
>
> Thanks,
>
> Scott
>
>
> _______________________________________________
> Condorligo mailing list
> Condorligo__AT__aei.mpg.de
> http://lists.aei.mpg.de/cgi-bin/mailman/listinfo/condorligo
--
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba__AT__cs.wisc.edu 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
===========================================================================
Date mail was appended: Tue Aug 18 15:02:55 2009 (1250625776)
Date: Tue, 18 Aug 2009 15:07:16 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
condor-admin response tracking system wrote:
> Hi,
>
>
>> Scott:
>>
>> Have you tried the "initialdir" change that Todd suggested last week?
>> Also, if we could see a submit file, that would help.
>>
>
> Please download
>
> http://www.lsc-group.phys.uwm.edu/~skoranda/LIGO-Condor-ticket-19609-01.tar.gz
>
Thanks, Scott, this was really helpful. When condor_submit runs, if
initidir isn't set in the submit file, it calls
getcwd to populate the IWD. This getcwd seems to be failing because of
automount issues. See https://bugzilla.redhat.com/show_bug.cgi?id=452122
if the InitialDir option is set in the condor-submit file, then condor
doesn't run the getcwd. This would be the best way to work around the
problem.
-Greg
===========================================================================
Date mail was appended: Tue Aug 18 15:07:21 2009 (1250626041)
Date: Tue, 18 Aug 2009 15:22:48 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
dabrown__AT__physics.syr.edu, rosso__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
> Scott Koranda wrote:
> > Hi,
> >
> >> Scott:
> >>
> >> Have you tried the "initialdir" change that Todd suggested last week?
>
> I still think initialdir suggestion will help.
Thanks. Unfortunately it is not simple to change the scripts
that generate the DAG/submit files to add an appropriate value
for 'initialdir'.
>
> But realize this is really just working around a Linux bug. Seems like
> Red Hat has seen this exact same problem and supposedly has a fix.
>
> Take a look at
> https://bugzilla.redhat.com/show_bug.cgi?id=452122
>
> Deja vu, eh? :)
>
> The last two comments are :
>
> "This issue has been fixed in the latest autofs package
> autofs-5.0.1-0.rc2.125."
>
> and
>
> "RHEL 5.4 Beta has been released! There should be a fix present in the
> Beta release that addresses this particular request."
>
> regards,
> Todd
>
Looking at the latest CentOS SRPM at
http://mirror.anl.gov/pub/centos/5.3/updates/SRPMS/autofs-5.0.1-0.rc2.102.el5_3.1.src.rpm
I don't see any comment about that bug. So I don't think we
have access to the supposed fix yet.
I am probably going to suggest that we stop using automount
and just mount the NFS partitions directly.
Thanks,
Scott
===========================================================================
Date mail was appended: Tue Aug 18 15:22:56 2009 (1250626977)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Tue Aug 18 15:28:55 2009 (1250627335)
Subject: Actions
Ticket was reopened by mailnull
===========================================================================
Date of actions: Tue Aug 18 16:12:07 2009 (1250629927)
Date: Tue, 18 Aug 2009 16:10:43 -0500
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
condorligo__AT__aei.mpg.de, dabrown__AT__physics.syr.edu
Subject: Re: [CondorLIGO] Re: [condor-admin #19609] LIGO: mangled paths
with autofs and condor
Scott Koranda wrote:
> Looking at the latest CentOS SRPM at
>
> http://mirror.anl.gov/pub/centos/5.3/updates/SRPMS/autofs-5.0.1-0.rc2.102.el5_3.1.src.rpm
>
> I don't see any comment about that bug. So I don't think we
> have access to the supposed fix yet.
>
The above is rev 102 of autofs. You need rev 125 or above, or you could
use something newer like autofs 5.0.4.
You could always build autofs yourself right from the source of the
bits, instead of waiting around for Centos:
http://ftp.kernel.org/pub/linux/daemons/autofs/v5/
> I am probably going to suggest that we stop using automount
> and just mount the NFS partitions directly.
>
Yeah, I guess that would work as well. :)
regards,
Todd
--
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba__AT__cs.wisc.edu 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
===========================================================================
Date mail was appended: Tue Aug 18 16:12:07 2009 (1250629927)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condorligo__AT__aei.mpg.de
Subject: Re: [CondorLIGO] Re: [condor-admin #19609] LIGO: mangled paths
with autofs and condor
Date: Wed, 19 Aug 2009 09:14:09 +0200
CC: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>,
"condor-admin response tracking system" <condor-admin__AT__cs.wisc.edu>
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi Todd,
On Tuesday 18 August 2009 23:10:43 Todd Tannenbaum wrote:
>
> The above is rev 102 of autofs. You need rev 125 or above, or you could
> use something newer like autofs 5.0.4.
rev \d+ looks like subversion revisions to me, although autofs5 is hosted with
git AFAIK.
Can you point me please to the patch (or part of the patch) which fixes the
observed problem? I would like to see if the Debian package of autofs5 5.0.3
contains this patch or not.
Cheers
Carsten
===========================================================================
Date mail was appended: Wed Aug 19 2:14:22 2009 (1250666063)
Date: Wed, 19 Aug 2009 11:58:41 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
condor-admin response tracking system wrote:
> Hi Todd,
>
> On Tuesday 18 August 2009 23:10:43 Todd Tannenbaum wrote:
>
>> The above is rev 102 of autofs. You need rev 125 or above, or you could
>> use something newer like autofs 5.0.4.
>>
>
> rev \d+ looks like subversion revisions to me, although autofs5 is hosted with
> git AFAIK.
>
I'm not sure how rpms map into debian releases, but the fix is in
autofs-5.0.1-0.rc2.125. I assume you are running Debian stable, but I
can't figure out if that patch is in stable.
-Greg
===========================================================================
Date mail was appended: Wed Aug 19 11:58:46 2009 (1250701126)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Thu Aug 20 13:45:16 2009 (1250793916)
Subject: Actions
Ticket was reopened by mailnull
===========================================================================
Date of actions: Mon Aug 24 10:11:46 2009 (1251126706)
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>, Stuart Anderson
<anderson__AT__ligo.caltech.edu>, Ross Oldenburg
<rosso__AT__gravity.phys.uwm.edu>, Carsten Aulbert
<carsten.aulbert__AT__aei.mpg.de>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
Date: Mon, 24 Aug 2009 04:28:21 -0700
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
definitions=2009-08-24_06:2009-08-11,2009-08-24,2009-08-24 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
reason=mlx engine=5.0.0-0907200000 definitions=main-0908240082
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi Scott,
On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
> I am probably going to suggest that we stop using automount
> and just mount the NFS partitions directly.
I think this is a good idea. For user home directories, automount
seems to cause more problems than it solves.
Cheers,
Duncan.
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date mail was appended: Mon Aug 24 10:11:46 2009 (1251126706)
Date: Mon, 24 Aug 2009 10:20:24 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
condor-admin response tracking system wrote:
> Hi Scott,
>
> On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
>
>> I am probably going to suggest that we stop using automount
>> and just mount the NFS partitions directly.
LIGO-ers:
Do y'all need any more help from us, or can I close this ticket? As you
know, we're very keen on seeing you succeed, so even if it isn't a
specific Condor problem, we're willing to help out as much as we can.
-Greg
===========================================================================
Date mail was appended: Mon Aug 24 10:20:30 2009 (1251127231)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Mon Aug 24 10:22:01 2009 (1251127321)
Subject: Actions
Ticket was reopened by mailnull
===========================================================================
Date of actions: Mon Aug 24 11:05:39 2009 (1251129940)
Date: Mon, 24 Aug 2009 11:05:30 -0500
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: condorligo__AT__aei.mpg.de, anderson__AT__ligo.caltech.edu,
dabrown__AT__physics.syr.edu, rosso__AT__gravity.phys.uwm.edu,
carsten.aulbert__AT__aei.mpg.de
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi,
> condor-admin response tracking system wrote:
> > Hi Scott,
> >
> > On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
> >
> >> I am probably going to suggest that we stop using automount
> >> and just mount the NFS partitions directly.
> LIGO-ers:
>
> Do y'all need any more help from us, or can I close this ticket?
I think you should close the ticket, but I defer to Stuart...
> As you
> know, we're very keen on seeing you succeed, so even if it isn't a
> specific Condor problem, we're willing to help out as much as we can.
Thanks much. I appreciate your time.
Cheers,
Scott
===========================================================================
Date mail was appended: Mon Aug 24 11:05:39 2009 (1251129940)
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
condorligo__AT__aei.mpg.de, dabrown__AT__physics.syr.edu, rosso__AT__gravity.phys.uwm.edu,
carsten.aulbert__AT__aei.mpg.de
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #19609] LIGO: mangled paths with autofs and condor
Date: Mon, 24 Aug 2009 09:23:48 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Aug 24, 2009, at 9:05 AM, Scott Koranda wrote:
> Hi,
>
>> condor-admin response tracking system wrote:
>>> Hi Scott,
>>>
>>> On Aug 18, 2009, at 1:22 PM, Scott Koranda wrote:
>>>
>>>> I am probably going to suggest that we stop using automount
>>>> and just mount the NFS partitions directly.
>> LIGO-ers:
>>
>> Do y'all need any more help from us, or can I close this ticket?
>
> I think you should close the ticket, but I defer to Stuart...
From what I understand this ticket should be closed. The Condor team
support has been very helpful, but now that the problem is isolated to
Linux automount issues the ball is strictly in LIGO's court.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Aug 24 11:24:03 2009 (1251131044)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Mon Aug 24 17:15:22 2009 (1251152122)