LIGO Support Ticket 17438
Ticket Information
Number: admin 17438
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu,grimaldi__AT__ligo.mit.edu,tannenba__AT__cs.wisc.edu
Status: pending
Assigned To: jfrey
Date: Fri, 1 Feb 2008 16:42:35 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>, Filippo
Grimaldi <grimaldi__AT__ligo.mit.edu>
Subject: LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
The LIGO Condor pool at Caltech running,
# condor_version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
had condor_gridmanager coredump,
# ls -l core
-rw-r--r-- 1 root root 1302528 Jan 16 13:26 core
# file core
core: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'condor_gridmanag'
# gdb /usr/sbin/condor_gridmanager core
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: core file may not match specified executable file.
Core was generated by `condor_gridmanager -f -C (Owner=?="grimaldi"&&JobUniverse==9) -o grimaldi -S /tm'.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.5...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.5
Reading symbols from /lib64/libm.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
#0 0x0000000000571360 in WriteCoreDump ()
(gdb) where
#0 0x0000000000571360 in WriteCoreDump ()
#1 0x0000000000560772 in linux_sig_coredump ()
#2 <signal handler called>
#3 0x0000000000584748 in List<char>::Rewind ()
#4 0x000000000059dc01 in StringList::contains_anycase ()
#5 0x000000000050e906 in GahpServer::Initialize ()
#6 0x000000000050e80a in GahpClient::Initialize ()
#7 0x00000000004e299f in GT4Job::doEvaluateState ()
#8 0x0000000000570dd9 in TimerManager::Timeout ()
#9 0x0000000000552e47 in DaemonCore::Driver ()
#10 0x0000000000562f3c in main ()
As far as I can recall there was no obituary email or other automatic
notification of this event, presumably since gridmanager is not a
persistent daemon.
I am not certain, but I believe the following Gridmanager log is the
right one, since it is the only one with an entry on the same day
as the timestamp on the core file,
# ls -l GridmanagerLog.grimaldi
-rw-r--r-- 1 condor condor 704412 Jan 16 13:31 GridmanagerLog.grimaldi
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Fri Feb 1 18:42:53 2008 (1201912975)
Date: Fri, 1 Feb 2008 16:46:39 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>, Filippo
Grimaldi <grimaldi__AT__ligo.mit.edu>
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
The log file and core image for this ticket may be found at,
http://www.ligo.caltech.edu/~anderson/condor.17438
Note, this has only happened once to the best of my knowledge.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Feb 1 18:46:55 2008 (1201913215)
Subject: Actions
Assigned to jfrey by roy
===========================================================================
Date of actions: Mon Feb 4 10:16:13 2008 (1202141773)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Mon, 4 Feb 2008 11:15:40 -0600
> The log file and core image for this ticket may be found at,
>
> http://www.ligo.caltech.edu/~anderson/condor.17438
Can you also post your condor_gridmanager binary? I'm having trouble
finding the same binary from our tarballs.
Also, which tarball or rpm did you install?
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Mon Feb 4 11:15:51 2008 (1202145352)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Mon Feb 4 11:15:51 2008 (1202145354)
Date: Mon, 4 Feb 2008 09:54:43 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, grimaldi__AT__ligo.mit.edu, tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Mon, Feb 04, 2008 at 11:15:51AM -0600, condor-admin response tracking system wrote:
> > The log file and core image for this ticket may be found at,
> >
> > http://www.ligo.caltech.edu/~anderson/condor.17438
>
>
> Can you also post your condor_gridmanager binary? I'm having trouble
> finding the same binary from our tarballs.
I have posted it to the same URL.
>
> Also, which tarball or rpm did you install?
>
The RHEL3 x86_64 6.9.5 tarball.
# file condor_gridmanager
condor_gridmanager: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs), for GNU/Linux 2.4.0, stripped
# ident condor_gridmanager
condor_gridmanager:
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$Id: kdb5_err.et 13854 2001-10-25 20:20:57Z tlyu $
$Id: krb5_err.et 16816 2004-10-13 16:18:27Z lxs $
$Id: accept_sec_context.c,v 1.30.4.2 2007/05/17 15:38:32 bester Exp $
$Id: acquire_cred.c,v 1.12.4.1 2007/05/17 15:38:32 bester Exp $
$Id: compare_name.c,v 1.22.4.2 2005/07/13 20:17:52 mlink Exp $
$Id: delete_sec_context.c,v 1.13 2005/04/15 23:37:16 meder Exp $
$Id: display_name.c,v 1.10 2005/04/15 23:37:16 meder Exp $
$Id: display_status.c,v 1.19 2005/04/15 23:37:16 meder Exp $
$Id: import_name.c,v 1.15 2005/04/15 23:37:18 meder Exp $
$Id: init_sec_context.c,v 1.31.4.4 2007/05/17 15:38:32 bester Exp $
$Id: inquire_cred.c,v 1.10 2005/04/15 23:37:19 meder Exp $
$Id: inquire_context.c,v 1.11 2005/04/15 23:37:19 meder Exp $
$Id: oid_functions.c,v 1.13 2005/04/15 23:37:19 meder Exp $
$Id: release_cred.c,v 1.5 2005/04/15 23:37:20 meder Exp $
$Id: release_name.c,v 1.6 2005/04/15 23:37:20 meder Exp $
$Id: unwrap.c,v 1.17 2005/04/15 23:37:20 meder Exp $
$Id: verify_mic.c,v 1.12 2005/04/15 23:37:21 meder Exp $
$Id: wrap.c,v 1.12 2005/04/15 23:37:21 meder Exp $
$Id: release_buffer.c,v 1.3 2005/04/15 23:37:20 meder Exp $
$Id: globus_i_gsi_gss_utils.c,v 1.38.4.1 2005/05/04 00:19:37 meder Exp $
$Id: import_cred.c,v 1.19.4.1 2007/05/17 15:38:32 bester Exp $
$Id: get_mic.c,v 1.7 2005/04/15 23:37:17 meder Exp $
$GCBVersion: 1.5.0 $
$GCBBuildDate: Nov 19 2007 $
However, we have now upgraded this particular condor pool to 7.0.0.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Feb 4 11:55:00 2008 (1202147701)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Mon, 4 Feb 2008 12:53:59 -0600
On Feb 4, 2008, at 11:55 AM, condor-admin response tracking system
wrote:
> On Mon, Feb 04, 2008 at 11:15:51AM -0600, condor-admin response
> tracking system wrote:
>>> The log file and core image for this ticket may be found at,
>>>
>>> http://www.ligo.caltech.edu/~anderson/condor.17438
>>
>>
>> Can you also post your condor_gridmanager binary? I'm having trouble
>> finding the same binary from our tarballs.
>
> I have posted it to the same URL.
>
>> Also, which tarball or rpm did you install?
>
> The RHEL3 x86_64 6.9.5 tarball.
>
>
> # file condor_gridmanager
> condor_gridmanager: ELF 64-bit LSB executable, AMD x86-64, version 1
> (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs),
> for GNU/Linux 2.4.0, stripped
>
> # ident condor_gridmanager
> condor_gridmanager:
> $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> $Id: kdb5_err.et 13854 2001-10-25 20:20:57Z tlyu $
> $Id: krb5_err.et 16816 2004-10-13 16:18:27Z lxs $
> $Id: accept_sec_context.c,v 1.30.4.2 2007/05/17 15:38:32 bester
> Exp $
> $Id: acquire_cred.c,v 1.12.4.1 2007/05/17 15:38:32 bester Exp $
> $Id: compare_name.c,v 1.22.4.2 2005/07/13 20:17:52 mlink Exp $
> $Id: delete_sec_context.c,v 1.13 2005/04/15 23:37:16 meder Exp $
> $Id: display_name.c,v 1.10 2005/04/15 23:37:16 meder Exp $
> $Id: display_status.c,v 1.19 2005/04/15 23:37:16 meder Exp $
> $Id: import_name.c,v 1.15 2005/04/15 23:37:18 meder Exp $
> $Id: init_sec_context.c,v 1.31.4.4 2007/05/17 15:38:32 bester
> Exp $
> $Id: inquire_cred.c,v 1.10 2005/04/15 23:37:19 meder Exp $
> $Id: inquire_context.c,v 1.11 2005/04/15 23:37:19 meder Exp $
> $Id: oid_functions.c,v 1.13 2005/04/15 23:37:19 meder Exp $
> $Id: release_cred.c,v 1.5 2005/04/15 23:37:20 meder Exp $
> $Id: release_name.c,v 1.6 2005/04/15 23:37:20 meder Exp $
> $Id: unwrap.c,v 1.17 2005/04/15 23:37:20 meder Exp $
> $Id: verify_mic.c,v 1.12 2005/04/15 23:37:21 meder Exp $
> $Id: wrap.c,v 1.12 2005/04/15 23:37:21 meder Exp $
> $Id: release_buffer.c,v 1.3 2005/04/15 23:37:20 meder Exp $
> $Id: globus_i_gsi_gss_utils.c,v 1.38.4.1 2005/05/04 00:19:37
> meder Exp $
> $Id: import_cred.c,v 1.19.4.1 2007/05/17 15:38:32 bester Exp $
> $Id: get_mic.c,v 1.7 2005/04/15 23:37:17 meder Exp $
> $GCBVersion: 1.5.0 $
> $GCBBuildDate: Nov 19 2007 $
>
>
>
> However, we have now upgraded this particular condor pool to 7.0.0.
This is very bizarre. The condor_gridmanager binary I pulled from our
website is different than the one you have. Also, the backtrace in the
core file is different. This is what I get (with either binary):
#0 0x0000000000571360 in WriteCoreDump (file_name=0x7fff7a28d270 "")
at src/coredumper.c:136
#1 0x0000000000560772 in linux_sig_coredump (signum=11)
at daemon_core_main.C:570
#2 0x00002add311182b0 in __libc_sigaction () from /lib64/libc.so.6
#3 0x00002add3131c680 in ?? ()
#4 0x00007fff7a294680 in ?? ()
#5 0x00000000004cc280 in ?? ()
#6 0x00007fff7a295950 in ?? ()
#7 0x0000000000000000 in ?? ()
#8 0x0000000000000000 in ?? ()
#9 0x00002add3115468e in malloc_get_state () from /lib64/libc.so.6
#10 0x00002add30d15f3e in operator delete () from /usr/lib64/libstdc+
+.so.5
#11 0x00002add30d15f79 in operator delete[] () from /usr/lib64/libstdc+
+.so.5
#12 0x00000000005d5d62 in passwd_cache::init_groups (this=Cannot
access memory at address 0xfffffffffffffff8
)
Can you verify that the core file you provided matches the backtrace
you reported? Maybe it was overwritten by a later crash.
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Mon Feb 4 12:54:17 2008 (1202151259)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Mon Feb 4 12:54:17 2008 (1202151260)
Date: Mon, 4 Feb 2008 11:21:03 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, grimaldi__AT__ligo.mit.edu, tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Mon, Feb 04, 2008 at 12:54:17PM -0600, condor-admin response tracking system wrote:
>
> This is very bizarre. The condor_gridmanager binary I pulled from our
> website is different than the one you have. Also, the backtrace in the
> core file is different. This is what I get (with either binary):
>
> #0 0x0000000000571360 in WriteCoreDump (file_name=0x7fff7a28d270 "")
> at src/coredumper.c:136
> #1 0x0000000000560772 in linux_sig_coredump (signum=11)
> at daemon_core_main.C:570
> #2 0x00002add311182b0 in __libc_sigaction () from /lib64/libc.so.6
> #3 0x00002add3131c680 in ?? ()
> #4 0x00007fff7a294680 in ?? ()
> #5 0x00000000004cc280 in ?? ()
> #6 0x00007fff7a295950 in ?? ()
> #7 0x0000000000000000 in ?? ()
> #8 0x0000000000000000 in ?? ()
> #9 0x00002add3115468e in malloc_get_state () from /lib64/libc.so.6
> #10 0x00002add30d15f3e in operator delete () from /usr/lib64/libstdc+
> +.so.5
> #11 0x00002add30d15f79 in operator delete[] () from /usr/lib64/libstdc+
> +.so.5
> #12 0x00000000005d5d62 in passwd_cache::init_groups (this=Cannot
> access memory at address 0xfffffffffffffff8
> )
>
> Can you verify that the core file you provided matches the backtrace
> you reported? Maybe it was overwritten by a later crash.
>
I copied the executable and core file from the URL and ran,
[root@ldas-grid tmp]# gunzip core.gz
[root@ldas-grid tmp]# gunzip condor_gridmanager.gz
[root@ldas-grid tmp]# gdb ./condor_gridmanager core
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: core file may not match specified executable file.
Core was generated by `condor_gridmanager -f -C (Owner=?="grimaldi"&&JobUniverse==9) -o grimaldi -S /tm'.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.5...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.5
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
#0 0x0000000000571360 in WriteCoreDump ()
(gdb) where
#0 0x0000000000571360 in WriteCoreDump ()
#1 0x0000000000560772 in linux_sig_coredump ()
#2 <signal handler called>
#3 0x0000000000584748 in List<char>::Rewind ()
#4 0x000000000059dc01 in StringList::contains_anycase ()
#5 0x000000000050e906 in GahpServer::Initialize ()
#6 0x000000000050e80a in GahpClient::Initialize ()
#7 0x00000000004e299f in GT4Job::doEvaluateState ()
#8 0x0000000000570dd9 in TimerManager::Timeout ()
#9 0x0000000000552e47 in DaemonCore::Driver ()
#10 0x0000000000562f3c in main ()
(gdb) [root@ldas-grid tmp]# md5sum condor_gridmanager core
1655bf4f1704f799259673b03fb07b83 condor_gridmanager
164cc964e1296b7098d90b8201fc6f6a core
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Feb 4 13:21:37 2008 (1202152898)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Mon, 4 Feb 2008 15:15:37 -0600
On Feb 4, 2008, at 1:21 PM, condor-admin response tracking system wrote:
> On Mon, Feb 04, 2008 at 12:54:17PM -0600, condor-admin response
> tracking system wrote:
>>
>> This is very bizarre. The condor_gridmanager binary I pulled from our
>> website is different than the one you have. Also, the backtrace in
>> the
>> core file is different. This is what I get (with either binary):
...
>>
>> Can you verify that the core file you provided matches the backtrace
>> you reported? Maybe it was overwritten by a later crash.
>
> I copied the executable and core file from the URL and ran,
...
What version of linux are you running?
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Mon Feb 4 15:15:46 2008 (1202159747)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Mon Feb 4 15:15:46 2008 (1202159748)
Date: Mon, 4 Feb 2008 13:56:49 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, grimaldi__AT__ligo.mit.edu, tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Mon, Feb 04, 2008 at 03:15:46PM -0600, condor-admin response tracking system wrote:
> On Feb 4, 2008, at 1:21 PM, condor-admin response tracking system wrote:
>
> > On Mon, Feb 04, 2008 at 12:54:17PM -0600, condor-admin response
> > tracking system wrote:
> >>
> >> This is very bizarre. The condor_gridmanager binary I pulled from our
> >> website is different than the one you have. Also, the backtrace in
> >> the
> >> core file is different. This is what I get (with either binary):
> ...
> >>
> >> Can you verify that the core file you provided matches the backtrace
> >> you reported? Maybe it was overwritten by a later crash.
> >
> > I copied the executable and core file from the URL and ran,
>
> ...
>
> What version of linux are you running?
>
[root@ldas-grid ~]# cat /etc/redhat-release
Fedora Core release 4 (Stentz)
[root@ldas-grid ~]# uname -a
Linux ldas-grid 2.6.20.20-CIT #1 SMP Mon Oct 1 13:12:15 PDT 2007 x86_64 x86_64 x86_64 GNU/Linux
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Feb 4 15:57:22 2008 (1202162244)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Mon, 4 Feb 2008 19:09:21 -0600
I see what's causing the crash in the gridmanager and am working on a
fix.
Let's work on correcting the ultimate cause. One of the gridmanager's
gahp servers is failing to start properly. What type of jobs are you
submitting (e.g. gt2, gt4)?
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Mon Feb 4 19:09:36 2008 (1202173778)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Mon Feb 4 19:09:36 2008 (1202173779)
Date: Mon, 4 Feb 2008 18:05:52 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, grimaldi__AT__ligo.mit.edu, tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Filippo,
Can you answer this question?
Do have a simple example script that is working for other sites?
You might also just try running the same script at Caltech again now
that we have upgraded to 7.0.0
Thanks.
On Mon, Feb 04, 2008 at 07:09:36PM -0600, condor-admin response tracking system wrote:
> I see what's causing the crash in the gridmanager and am working on a
> fix.
>
> Let's work on correcting the ultimate cause. One of the gridmanager's
> gahp servers is failing to start properly. What type of jobs are you
> submitting (e.g. gt2, gt4)?
>
> Thanks and regards,
> Jaime Frey
> UW-Madison Condor Team
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Feb 4 20:06:11 2008 (1202177173)
Date: Tue, 5 Feb 2008 09:58:55 -0500 (EST)
From: Filippo Grimaldi <grimaldi__AT__ligo.mit.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
skoranda__AT__gravity.phys.uwm.edu, tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
Hi,
the script is the easiest you can get:
Universe = grid
grid_resource = gt4 osg-ligo.mit.edu:9443 Condor
Executable = /bin/hostname
Notification = NEVER
Output = host4.out
Error = host4.err
Log = host4.log
Queue 1
and it still does not work at caltech...
Filippo
On Mon, 4 Feb 2008, Stuart Anderson wrote:
> Filippo,
> Can you answer this question?
>
> Do have a simple example script that is working for other sites?
> You might also just try running the same script at Caltech again now
> that we have upgraded to 7.0.0
>
> Thanks.
>
>
> On Mon, Feb 04, 2008 at 07:09:36PM -0600, condor-admin response tracking system wrote:
>> I see what's causing the crash in the gridmanager and am working on a
>> fix.
>>
>> Let's work on correcting the ultimate cause. One of the gridmanager's
>> gahp servers is failing to start properly. What type of jobs are you
>> submitting (e.g. gt2, gt4)?
>>
>> Thanks and regards,
>> Jaime Frey
>> UW-Madison Condor Team
>>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
===========================================================================
Date mail was appended: Tue Feb 5 8:59:15 2008 (1202223556)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Tue, 5 Feb 2008 10:57:53 -0600
> the script is the easiest you can get:
>
> Universe = grid
> grid_resource = gt4 osg-ligo.mit.edu:9443 Condor
> Executable = /bin/hostname
> Notification = NEVER
> Output = host4.out
> Error = host4.err
> Log = host4.log
> Queue 1
>
> and it still does not work at caltech...
The problem is likely caused by Condor not finding the right version
of java. Try running <condor>/sbin/gt4_gahp on the command line. It
should print the following line, then wait for input:
$GahpVersion: 1.7.0 Jun 18 2007 GT4\ GAHP\ (GT-4.0.3) $
If it prints something different, check the JAVA parameter in your
Condor configuration file.
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Tue Feb 5 10:58:04 2008 (1202230690)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Tue Feb 5 10:58:04 2008 (1202230691)
Date: Tue, 5 Feb 2008 09:38:29 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, grimaldi__AT__ligo.mit.edu, tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Tue, Feb 05, 2008 at 10:58:04AM -0600, condor-admin response tracking system wrote:
> > the script is the easiest you can get:
> >
> > Universe = grid
> > grid_resource = gt4 osg-ligo.mit.edu:9443 Condor
> > Executable = /bin/hostname
> > Notification = NEVER
> > Output = host4.out
> > Error = host4.err
> > Log = host4.log
> > Queue 1
> >
> > and it still does not work at caltech...
>
>
> The problem is likely caused by Condor not finding the right version
> of java. Try running <condor>/sbin/gt4_gahp on the command line. It
> should print the following line, then wait for input:
>
> $GahpVersion: 1.7.0 Jun 18 2007 GT4\ GAHP\ (GT-4.0.3) $
>
> If it prints something different, check the JAVA parameter in your
> Condor configuration file.
>
That is indeed the problem. gt4_gahp does not work with the FC4 bundled
version of java (libgcj-4.0.2-8.fc4) and throws a highly illuminating
stack trace on startup:
$ /usr/sbin/gt4_gahp
Exception in thread "main" java.lang.NoClassDefFoundError: while resolving class: condor.gahp.Gahp
at java.lang.VMClassLoader.transformException(java.lang.Class, java.lang.Throwable) (/usr/lib64/libgcj.so.6.0.0)
at java.lang.VMClassLoader.resolveClass(java.lang.Class) (/usr/lib64/libgcj.so.6.0.0)
at java.lang.Class.initializeClass() (/usr/lib64/libgcj.so.6.0.0)
at java.lang.reflect.Method.invoke(java.lang.Object, java.lang.Object[]) (/usr/lib64/libgcj.so.6.0.0)
at org.globus.bootstrap.BootstrapBase.launch(java.lang.String, java.lang.String[]) (Unknown Source)
at org.globus.bootstrap.Bootstrap.main(java.lang.String[]) (Unknown Source)
at .main (/usr/lib64/libgij.so.6.0.0)
at .__libc_start_main (/lib64/libc-2.3.6.so)
Caused by: java.lang.ClassNotFoundException: java.lang.StringBuilder not found in gnu.gcj.runtime.SystemClassLoader{
....
Condor 7.0.0 also has condor_gridmanager throwing core dumps under these
circumstances as well, so that is still worth cleaning up.
I just changed JAVA to point to the VDT distributed version,
$ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_13-b05, mixed mode)
and I now get,
$ gt4_gahp
$GahpVersion: 1.7.0 Jun 18 2007 GT4\ GAHP\ (GT-4.0.3) $
What versions of java are required for GT4 support in Condor?
Filippo,
Please try your teset job again.
Scott,
On hydra.phys.uwm.edu I think Condor is pointed at the same VDT
version of java, however, even I make this explicit I still get a different
error message:
[anderson@hydra condor]$ env _CONDOR_JAVA=/ldcg/ldg/vdt/jdk1.5/bin/java gt4_gahp
gt4_gahp_wrapper: execv failed, errno=2
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Feb 5 11:38:45 2008 (1202233126)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Tue, 5 Feb 2008 11:49:43 -0600
> What versions of java are required for GT4 support in Condor?
Currently, java 1.5 is required. We're looking at making it work with
java 1.4 as well for future releases. The Globus 4.0 installation
guide says not to use gcj.
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Tue Feb 5 11:49:53 2008 (1202233794)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Tue Feb 5 11:49:53 2008 (1202233796)
Date: Tue, 5 Feb 2008 20:02:39 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, grimaldi__AT__ligo.mit.edu, tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
On Tue, Feb 05, 2008 at 11:49:53AM -0600, condor-admin response tracking system wrote:
> > What versions of java are required for GT4 support in Condor?
>
> Currently, java 1.5 is required. We're looking at making it work with
> java 1.4 as well for future releases. The Globus 4.0 installation
> guide says not to use gcj.
>
Thanks. You might want to add some of these constraints to the Condor manual,
or possibly a link to the GT manual, for other naive java users like myself.
What I did find in the Condor GT4 Grid Type section,
http://www.cs.wisc.edu/condor/manual/v7.0/5_3Grid_Universe.html#33446
is a requirement for Java 1.4.2 or higher.
As a meta problem, I would appreciate it if you would look at,
http://www.cs.wisc.edu/condor/ligo-tickets/index-unresolved.html
to see why the "Last Updated" timestamp is stuck at,
"Sat Feb 2 13:34:34 CST 2008"
and why this problem ticket, at least as seen via the link from this
URL, is not showing this recent thread of emails that identified
java as the main problem.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Feb 5 22:03:02 2008 (1202270587)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Wed, 6 Feb 2008 12:27:31 -0600
On Feb 5, 2008, at 10:03 PM, condor-admin response tracking system
wrote:
> On Tue, Feb 05, 2008 at 11:49:53AM -0600, condor-admin response
> tracking system wrote:
>>> What versions of java are required for GT4 support in Condor?
>>
>> Currently, java 1.5 is required. We're looking at making it work with
>> java 1.4 as well for future releases. The Globus 4.0 installation
>> guide says not to use gcj.
>
> Thanks. You might want to add some of these constraints to the
> Condor manual,
> or possibly a link to the GT manual, for other naive java users like
> myself.
> What I did find in the Condor GT4 Grid Type section,
> http://www.cs.wisc.edu/condor/manual/v7.0/5_3Grid_Universe.html#33446
> is a requirement for Java 1.4.2 or higher.
That should definitely be documented.
> As a meta problem, I would appreciate it if you would look at,
>
> http://www.cs.wisc.edu/condor/ligo-tickets/index-unresolved.html
> to see why the "Last Updated" timestamp is stuck at,
> "Sat Feb 2 13:34:34 CST 2008"
> and why this problem ticket, at least as seen via the link from this
> URL, is not showing this recent thread of emails that identified
> java as the main problem.
It appears the 'Last Modified' column on the summary page is actually
the date the ticket was created. I can bug the maintainer to make it
consistent. I assume you would prefer it to show the last modified
time, rather than changing the heading to match what's currently
displayed.
I see the full thread under the link for this ticket. I suspect the
pages are updated once or twice a day and the latest emails arrived
after the last update when you looked at it.
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Wed Feb 6 12:27:42 2008 (1202322463)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Wed Feb 6 12:27:42 2008 (1202322464)
Subject: Comments added
t
this ticket may be moot in the near future when ligo upgrades to centos 5.
Comments added by tannenba
===========================================================================
Date comments were added: Fri Jul 18 13:44:18 2008 (1216406658)
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>, Todd Tannenbaum
<tannenba__AT__cs.wisc.edu>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Fri, 2 Jan 2009 13:55:53 -0800
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
On Feb 6, 2008, at 10:27 AM, condor-admin response tracking system
wrote:
> On Feb 5, 2008, at 10:03 PM, condor-admin response tracking system
> wrote:
>
>> On Tue, Feb 05, 2008 at 11:49:53AM -0600, condor-admin response
>> tracking system wrote:
>>>> What versions of java are required for GT4 support in Condor?
>>>
>>> Currently, java 1.5 is required. We're looking at making it work
>>> with
>>> java 1.4 as well for future releases. The Globus 4.0 installation
>>> guide says not to use gcj.
>>
>> Thanks. You might want to add some of these constraints to the
>> Condor manual,
>> or possibly a link to the GT manual, for other naive java users like
>> myself.
>> What I did find in the Condor GT4 Grid Type section,
>> http://www.cs.wisc.edu/condor/manual/v7.0/5_3Grid_Universe.html#33446
>> is a requirement for Java 1.4.2 or higher.
>
> That should definitely be documented.
The Condor 7.2.0 documentation still claims that Java 1.4.2 or higher
is required, e.g.,
http://www.cs.wisc.edu/condor/manual/v7.2/5_3Grid_Universe.html#SECTION00632300000000000000
Is this now a true statement, i.e., was the backport to 1.4.2 added,
or is this still a documentation bug and 1.5 is still required?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Jan 2 15:57:04 2009 (1230933424)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17438] LIGO: condor_gridmanager coredump
Date: Mon, 12 Jan 2009 14:47:47 -0600
> The Condor 7.2.0 documentation still claims that Java 1.4.2 or higher
> is required, e.g.,
> http://www.cs.wisc.edu/condor/manual/v7.2/5_3Grid_Universe.html#SECTION00632300000000000000
>
> Is this now a true statement, i.e., was the backport to 1.4.2 added,
> or is this still a documentation bug and 1.5 is still required?
That is a documentation bug. I'm fixing it now. Sorry for the
continued confusion.
Thanks and regards,
Jaime Frey
UW-Madison Condor Team
===========================================================================
Date mail was appended: Mon Jan 12 14:47:50 2009 (1231793270)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Mon Jan 12 14:47:50 2009 (1231793271)